All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition
@ 2018-06-20 10:18 Xunlei Pang
  2018-06-20 10:18 ` [PATCH v2 2/2] sched/fair: Advance global expiration when period timer is restarted Xunlei Pang
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Xunlei Pang @ 2018-06-20 10:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Ben Segall; +Cc: linux-kernel

I noticed the group constantly got throttled even it consumed
low cpu usage, this caused some jitters on the response time
to some of our business containers enabling cpu quota.

It's very simple to reproduce:
  mkdir /sys/fs/cgroup/cpu/test
  cd /sys/fs/cgroup/cpu/test
  echo 100000 > cpu.cfs_quota_us
  echo $$ > tasks
then repeat:
  cat cpu.stat |grep nr_throttled  // nr_throttled will increase

After some analysis, we found that cfs_rq::runtime_remaining will
be cleared by expire_cfs_rq_runtime() due to two equal but stale
"cfs_{b|q}->runtime_expires" after period timer is re-armed.

The current condition to judge clock drift in expire_cfs_rq_runtime()
is wrong, the two runtime_expires are actually the same when clock
drift happens, so this condtion can never hit. The orginal design was
correctly done by commit a9cf55b28610 ("sched: Expire invalid runtime"),
but was changed to be the current one due to its locking issue.

This patch introduces another way, it adds a new field in both structures
cfs_rq and cfs_bandwidth to record the expiration update sequence, and
use them to figure out if clock drift happens(true if they equal).

Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
Cc: Ben Segall <bsegall@google.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
---
 kernel/sched/fair.c  | 14 ++++++++------
 kernel/sched/sched.h |  6 ++++--
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05aab7f..e6bb68d52962 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
 	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
+	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
 	u64 amount = 0, min_amount, expires;
+	int expires_seq;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
+	expires_seq = cfs_b->expires_seq;
 	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
@@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	 * spread between our sched_clock and the one on which runtime was
 	 * issued.
 	 */
-	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
+	if (cfs_rq->expires_seq != expires_seq) {
+		cfs_rq->expires_seq = expires_seq;
 		cfs_rq->runtime_expires = expires;
+	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
@@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	 * has not truly expired.
 	 *
 	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline has advanced. It is valid to compare
-	 * cfs_b->runtime_expires without any locks since we only care about
-	 * exact equality, so a partial write will still work.
+	 * whether the global deadline(cfs_b->expires_seq) has advanced.
 	 */
-
-	if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
+	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
 		/* extend local deadline, drift is bounded above by 2 ticks */
 		cfs_rq->runtime_expires += TICK_NSEC;
 	} else {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6601baf2361c..e977e04f8daf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -334,9 +334,10 @@ struct cfs_bandwidth {
 	u64			runtime;
 	s64			hierarchical_quota;
 	u64			runtime_expires;
+	int			expires_seq;
 
-	int			idle;
-	int			period_active;
+	short			idle;
+	short			period_active;
 	struct hrtimer		period_timer;
 	struct hrtimer		slack_timer;
 	struct list_head	throttled_cfs_rq;
@@ -551,6 +552,7 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
+	int			expires_seq;
 	u64			runtime_expires;
 	s64			runtime_remaining;
 
-- 
2.14.1.40.g8e62ba1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/2] sched/fair: Advance global expiration when period timer is restarted
  2018-06-20 10:18 [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition Xunlei Pang
@ 2018-06-20 10:18 ` Xunlei Pang
  2018-06-20 17:02   ` bsegall
  2018-07-03  7:50   ` [tip:sched/core] " tip-bot for Xunlei Pang
  2018-06-20 17:01 ` [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition bsegall
  2018-07-03  7:50 ` [tip:sched/core] " tip-bot for Xunlei Pang
  2 siblings, 2 replies; 9+ messages in thread
From: Xunlei Pang @ 2018-06-20 10:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Ben Segall; +Cc: linux-kernel

When period gets restarted after some idle time, start_cfs_bandwidth()
doesn't update the expiration information, expire_cfs_rq_runtime() will
see cfs_rq->runtime_expires smaller than rq clock and go to the clock
drift logic, wasting needless cpu cycles on the scheduler hot path.

Update the global expiration in start_cfs_bandwidth() to avoid frequent
expire_cfs_rq_runtime() calls once a new period begins.

Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
---
 kernel/sched/fair.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e6bb68d52962..f167aca066cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
+	u64 overrun;
+
 	lockdep_assert_held(&cfs_b->lock);
 
-	if (!cfs_b->period_active) {
-		cfs_b->period_active = 1;
-		hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-		hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
-	}
+	if (cfs_b->period_active)
+		return;
+
+	cfs_b->period_active = 1;
+	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
+	cfs_b->expires_seq++;
+	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
-- 
2.14.1.40.g8e62ba1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition
  2018-06-20 10:18 [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition Xunlei Pang
  2018-06-20 10:18 ` [PATCH v2 2/2] sched/fair: Advance global expiration when period timer is restarted Xunlei Pang
@ 2018-06-20 17:01 ` bsegall
  2018-06-21  3:56   ` Xunlei Pang
  2018-07-03  7:50 ` [tip:sched/core] " tip-bot for Xunlei Pang
  2 siblings, 1 reply; 9+ messages in thread
From: bsegall @ 2018-06-20 17:01 UTC (permalink / raw)
  To: Xunlei Pang; +Cc: Peter Zijlstra, Ingo Molnar, Ben Segall, linux-kernel

Xunlei Pang <xlpang@linux.alibaba.com> writes:

> I noticed the group constantly got throttled even it consumed
> low cpu usage, this caused some jitters on the response time
> to some of our business containers enabling cpu quota.
>
> It's very simple to reproduce:
>   mkdir /sys/fs/cgroup/cpu/test
>   cd /sys/fs/cgroup/cpu/test
>   echo 100000 > cpu.cfs_quota_us
>   echo $$ > tasks
> then repeat:
>   cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>
> After some analysis, we found that cfs_rq::runtime_remaining will
> be cleared by expire_cfs_rq_runtime() due to two equal but stale
> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
>
> The current condition to judge clock drift in expire_cfs_rq_runtime()
> is wrong, the two runtime_expires are actually the same when clock
> drift happens, so this condtion can never hit. The orginal design was
> correctly done by commit a9cf55b28610 ("sched: Expire invalid runtime"),
> but was changed to be the current one due to its locking issue.
>
> This patch introduces another way, it adds a new field in both structures
> cfs_rq and cfs_bandwidth to record the expiration update sequence, and
> use them to figure out if clock drift happens(true if they equal).
>
> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
> Cc: Ben Segall <bsegall@google.com>

Reviewed-By: Ben Segall <bsegall@google.com>

> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
> ---
>  kernel/sched/fair.c  | 14 ++++++++------
>  kernel/sched/sched.h |  6 ++++--
>  2 files changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e497c05aab7f..e6bb68d52962 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>  	now = sched_clock_cpu(smp_processor_id());
>  	cfs_b->runtime = cfs_b->quota;
>  	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> +	cfs_b->expires_seq++;
>  }
>  
>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> @@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  	struct task_group *tg = cfs_rq->tg;
>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>  	u64 amount = 0, min_amount, expires;
> +	int expires_seq;
>  
>  	/* note: this is a positive sum as runtime_remaining <= 0 */
>  	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> @@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  			cfs_b->idle = 0;
>  		}
>  	}
> +	expires_seq = cfs_b->expires_seq;
>  	expires = cfs_b->runtime_expires;
>  	raw_spin_unlock(&cfs_b->lock);
>  
> @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  	 * spread between our sched_clock and the one on which runtime was
>  	 * issued.
>  	 */
> -	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
> +	if (cfs_rq->expires_seq != expires_seq) {
> +		cfs_rq->expires_seq = expires_seq;
>  		cfs_rq->runtime_expires = expires;
> +	}
>  
>  	return cfs_rq->runtime_remaining > 0;
>  }
> @@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  	 * has not truly expired.
>  	 *
>  	 * Fortunately we can check determine whether this the case by checking
> -	 * whether the global deadline has advanced. It is valid to compare
> -	 * cfs_b->runtime_expires without any locks since we only care about
> -	 * exact equality, so a partial write will still work.
> +	 * whether the global deadline(cfs_b->expires_seq) has advanced.
>  	 */
> -
> -	if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> +	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
>  		/* extend local deadline, drift is bounded above by 2 ticks */
>  		cfs_rq->runtime_expires += TICK_NSEC;
>  	} else {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6601baf2361c..e977e04f8daf 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -334,9 +334,10 @@ struct cfs_bandwidth {
>  	u64			runtime;
>  	s64			hierarchical_quota;
>  	u64			runtime_expires;
> +	int			expires_seq;
>  
> -	int			idle;
> -	int			period_active;
> +	short			idle;
> +	short			period_active;
>  	struct hrtimer		period_timer;
>  	struct hrtimer		slack_timer;
>  	struct list_head	throttled_cfs_rq;
> @@ -551,6 +552,7 @@ struct cfs_rq {
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
>  	int			runtime_enabled;
> +	int			expires_seq;
>  	u64			runtime_expires;
>  	s64			runtime_remaining;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: Advance global expiration when period timer is restarted
  2018-06-20 10:18 ` [PATCH v2 2/2] sched/fair: Advance global expiration when period timer is restarted Xunlei Pang
@ 2018-06-20 17:02   ` bsegall
  2018-07-03  7:50   ` [tip:sched/core] " tip-bot for Xunlei Pang
  1 sibling, 0 replies; 9+ messages in thread
From: bsegall @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Xunlei Pang; +Cc: Peter Zijlstra, Ingo Molnar, Ben Segall, linux-kernel

Xunlei Pang <xlpang@linux.alibaba.com> writes:

> When period gets restarted after some idle time, start_cfs_bandwidth()
> doesn't update the expiration information, expire_cfs_rq_runtime() will
> see cfs_rq->runtime_expires smaller than rq clock and go to the clock
> drift logic, wasting needless cpu cycles on the scheduler hot path.
>
> Update the global expiration in start_cfs_bandwidth() to avoid frequent
> expire_cfs_rq_runtime() calls once a new period begins.
>
> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Reviewed-By: Ben Segall <bsegall@google.com>


> ---
>  kernel/sched/fair.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e6bb68d52962..f167aca066cc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  
>  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  {
> +	u64 overrun;
> +
>  	lockdep_assert_held(&cfs_b->lock);
>  
> -	if (!cfs_b->period_active) {
> -		cfs_b->period_active = 1;
> -		hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> -		hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> -	}
> +	if (cfs_b->period_active)
> +		return;
> +
> +	cfs_b->period_active = 1;
> +	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> +	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> +	cfs_b->expires_seq++;
> +	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>  }
>  
>  static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition
  2018-06-20 17:01 ` [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition bsegall
@ 2018-06-21  3:56   ` Xunlei Pang
  2018-06-21  8:08     ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Xunlei Pang @ 2018-06-21  3:56 UTC (permalink / raw)
  To: bsegall; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, stable

On 6/21/18 1:01 AM, bsegall@google.com wrote:
> Xunlei Pang <xlpang@linux.alibaba.com> writes:
> 
>> I noticed the group constantly got throttled even it consumed
>> low cpu usage, this caused some jitters on the response time
>> to some of our business containers enabling cpu quota.
>>
>> It's very simple to reproduce:
>>   mkdir /sys/fs/cgroup/cpu/test
>>   cd /sys/fs/cgroup/cpu/test
>>   echo 100000 > cpu.cfs_quota_us
>>   echo $$ > tasks
>> then repeat:
>>   cat cpu.stat |grep nr_throttled  // nr_throttled will increase
>>
>> After some analysis, we found that cfs_rq::runtime_remaining will
>> be cleared by expire_cfs_rq_runtime() due to two equal but stale
>> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
>>
>> The current condition to judge clock drift in expire_cfs_rq_runtime()
>> is wrong, the two runtime_expires are actually the same when clock
>> drift happens, so this condtion can never hit. The orginal design was
>> correctly done by commit a9cf55b28610 ("sched: Expire invalid runtime"),
>> but was changed to be the current one due to its locking issue.
>>
>> This patch introduces another way, it adds a new field in both structures
>> cfs_rq and cfs_bandwidth to record the expiration update sequence, and
>> use them to figure out if clock drift happens(true if they equal).
>>
>> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
>> Cc: Ben Segall <bsegall@google.com>
> 
> Reviewed-By: Ben Segall <bsegall@google.com>

Thanks Ben :-)

Hi Peter, could you please have a look at them?

Cc stable@vger.kernel.org

> 
>> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
>> ---
>>  kernel/sched/fair.c  | 14 ++++++++------
>>  kernel/sched/sched.h |  6 ++++--
>>  2 files changed, 12 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e497c05aab7f..e6bb68d52962 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>>  	now = sched_clock_cpu(smp_processor_id());
>>  	cfs_b->runtime = cfs_b->quota;
>>  	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
>> +	cfs_b->expires_seq++;
>>  }
>>  
>>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>> @@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>  	struct task_group *tg = cfs_rq->tg;
>>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>>  	u64 amount = 0, min_amount, expires;
>> +	int expires_seq;
>>  
>>  	/* note: this is a positive sum as runtime_remaining <= 0 */
>>  	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
>> @@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>  			cfs_b->idle = 0;
>>  		}
>>  	}
>> +	expires_seq = cfs_b->expires_seq;
>>  	expires = cfs_b->runtime_expires;
>>  	raw_spin_unlock(&cfs_b->lock);
>>  
>> @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>  	 * spread between our sched_clock and the one on which runtime was
>>  	 * issued.
>>  	 */
>> -	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
>> +	if (cfs_rq->expires_seq != expires_seq) {
>> +		cfs_rq->expires_seq = expires_seq;
>>  		cfs_rq->runtime_expires = expires;
>> +	}
>>  
>>  	return cfs_rq->runtime_remaining > 0;
>>  }
>> @@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>  	 * has not truly expired.
>>  	 *
>>  	 * Fortunately we can check determine whether this the case by checking
>> -	 * whether the global deadline has advanced. It is valid to compare
>> -	 * cfs_b->runtime_expires without any locks since we only care about
>> -	 * exact equality, so a partial write will still work.
>> +	 * whether the global deadline(cfs_b->expires_seq) has advanced.
>>  	 */
>> -
>> -	if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
>> +	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
>>  		/* extend local deadline, drift is bounded above by 2 ticks */
>>  		cfs_rq->runtime_expires += TICK_NSEC;
>>  	} else {
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 6601baf2361c..e977e04f8daf 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -334,9 +334,10 @@ struct cfs_bandwidth {
>>  	u64			runtime;
>>  	s64			hierarchical_quota;
>>  	u64			runtime_expires;
>> +	int			expires_seq;
>>  
>> -	int			idle;
>> -	int			period_active;
>> +	short			idle;
>> +	short			period_active;
>>  	struct hrtimer		period_timer;
>>  	struct hrtimer		slack_timer;
>>  	struct list_head	throttled_cfs_rq;
>> @@ -551,6 +552,7 @@ struct cfs_rq {
>>  
>>  #ifdef CONFIG_CFS_BANDWIDTH
>>  	int			runtime_enabled;
>> +	int			expires_seq;
>>  	u64			runtime_expires;
>>  	s64			runtime_remaining;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition
  2018-06-21  3:56   ` Xunlei Pang
@ 2018-06-21  8:08     ` Peter Zijlstra
  2018-06-21 10:51       ` Xunlei Pang
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-06-21  8:08 UTC (permalink / raw)
  To: Xunlei Pang; +Cc: bsegall, Ingo Molnar, linux-kernel, stable

On Thu, Jun 21, 2018 at 11:56:56AM +0800, Xunlei Pang wrote:
> >> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
> >> Cc: Ben Segall <bsegall@google.com>
> > 
> > Reviewed-By: Ben Segall <bsegall@google.com>
> 
> Thanks Ben :-)
> 
> Hi Peter, could you please have a look at them?

I grabbed them, thanks guys!

> 
> Cc stable@vger.kernel.org

For both or just the first?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition
  2018-06-21  8:08     ` Peter Zijlstra
@ 2018-06-21 10:51       ` Xunlei Pang
  0 siblings, 0 replies; 9+ messages in thread
From: Xunlei Pang @ 2018-06-21 10:51 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: bsegall, Ingo Molnar, linux-kernel, stable

On 6/21/18 4:08 PM, Peter Zijlstra wrote:
> On Thu, Jun 21, 2018 at 11:56:56AM +0800, Xunlei Pang wrote:
>>>> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
>>>> Cc: Ben Segall <bsegall@google.com>
>>>
>>> Reviewed-By: Ben Segall <bsegall@google.com>
>>
>> Thanks Ben :-)
>>
>> Hi Peter, could you please have a look at them?
> 
> I grabbed them, thanks guys!
> 
>>
>> Cc stable@vger.kernel.org
> 
> For both or just the first?
> 

Both, thanks!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tip:sched/core] sched/fair: Fix bandwidth timer clock drift condition
  2018-06-20 10:18 [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition Xunlei Pang
  2018-06-20 10:18 ` [PATCH v2 2/2] sched/fair: Advance global expiration when period timer is restarted Xunlei Pang
  2018-06-20 17:01 ` [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition bsegall
@ 2018-07-03  7:50 ` tip-bot for Xunlei Pang
  2 siblings, 0 replies; 9+ messages in thread
From: tip-bot for Xunlei Pang @ 2018-07-03  7:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, xlpang, bsegall, peterz, mingo, hpa, linux-kernel, tglx

Commit-ID:  512ac999d2755d2b7109e996a76b6fb8b888631d
Gitweb:     https://git.kernel.org/tip/512ac999d2755d2b7109e996a76b6fb8b888631d
Author:     Xunlei Pang <xlpang@linux.alibaba.com>
AuthorDate: Wed, 20 Jun 2018 18:18:33 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 3 Jul 2018 09:17:29 +0200

sched/fair: Fix bandwidth timer clock drift condition

I noticed that cgroup task groups constantly get throttled even
if they have low CPU usage, this causes some jitters on the response
time to some of our business containers when enabling CPU quotas.

It's very simple to reproduce:

  mkdir /sys/fs/cgroup/cpu/test
  cd /sys/fs/cgroup/cpu/test
  echo 100000 > cpu.cfs_quota_us
  echo $$ > tasks

then repeat:

  cat cpu.stat | grep nr_throttled  # nr_throttled will increase steadily

After some analysis, we found that cfs_rq::runtime_remaining will
be cleared by expire_cfs_rq_runtime() due to two equal but stale
"cfs_{b|q}->runtime_expires" after period timer is re-armed.

The current condition to judge clock drift in expire_cfs_rq_runtime()
is wrong, the two runtime_expires are actually the same when clock
drift happens, so this condtion can never hit. The orginal design was
correctly done by this commit:

  a9cf55b28610 ("sched: Expire invalid runtime")

... but was changed to be the current implementation due to its locking bug.

This patch introduces another way, it adds a new field in both structures
cfs_rq and cfs_bandwidth to record the expiration update sequence, and
uses them to figure out if clock drift happens (true if they are equal).

Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c  | 14 ++++++++------
 kernel/sched/sched.h |  6 ++++--
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1866e64792a7..791707c56886 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
 	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
+	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
 	u64 amount = 0, min_amount, expires;
+	int expires_seq;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
+	expires_seq = cfs_b->expires_seq;
 	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
@@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	 * spread between our sched_clock and the one on which runtime was
 	 * issued.
 	 */
-	if ((s64)(expires - cfs_rq->runtime_expires) > 0)
+	if (cfs_rq->expires_seq != expires_seq) {
+		cfs_rq->expires_seq = expires_seq;
 		cfs_rq->runtime_expires = expires;
+	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
@@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	 * has not truly expired.
 	 *
 	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline has advanced. It is valid to compare
-	 * cfs_b->runtime_expires without any locks since we only care about
-	 * exact equality, so a partial write will still work.
+	 * whether the global deadline(cfs_b->expires_seq) has advanced.
 	 */
-
-	if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
+	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
 		/* extend local deadline, drift is bounded above by 2 ticks */
 		cfs_rq->runtime_expires += TICK_NSEC;
 	} else {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 27ddec334601..c7742dcc136c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -334,9 +334,10 @@ struct cfs_bandwidth {
 	u64			runtime;
 	s64			hierarchical_quota;
 	u64			runtime_expires;
+	int			expires_seq;
 
-	int			idle;
-	int			period_active;
+	short			idle;
+	short			period_active;
 	struct hrtimer		period_timer;
 	struct hrtimer		slack_timer;
 	struct list_head	throttled_cfs_rq;
@@ -551,6 +552,7 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
+	int			expires_seq;
 	u64			runtime_expires;
 	s64			runtime_remaining;
 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [tip:sched/core] sched/fair: Advance global expiration when period timer is restarted
  2018-06-20 10:18 ` [PATCH v2 2/2] sched/fair: Advance global expiration when period timer is restarted Xunlei Pang
  2018-06-20 17:02   ` bsegall
@ 2018-07-03  7:50   ` tip-bot for Xunlei Pang
  1 sibling, 0 replies; 9+ messages in thread
From: tip-bot for Xunlei Pang @ 2018-07-03  7:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, linux-kernel, torvalds, bsegall, xlpang, hpa, mingo, peterz

Commit-ID:  f1d1be8aee6c461652aea8f58bedebaa73d7f4d3
Gitweb:     https://git.kernel.org/tip/f1d1be8aee6c461652aea8f58bedebaa73d7f4d3
Author:     Xunlei Pang <xlpang@linux.alibaba.com>
AuthorDate: Wed, 20 Jun 2018 18:18:34 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 3 Jul 2018 09:17:29 +0200

sched/fair: Advance global expiration when period timer is restarted

When period gets restarted after some idle time, start_cfs_bandwidth()
doesn't update the expiration information, expire_cfs_rq_runtime() will
see cfs_rq->runtime_expires smaller than rq clock and go to the clock
drift logic, wasting needless CPU cycles on the scheduler hot path.

Update the global expiration in start_cfs_bandwidth() to avoid frequent
expire_cfs_rq_runtime() calls once a new period begins.

Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180620101834.24455-2-xlpang@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 791707c56886..840b92ee6f89 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5204,13 +5204,18 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
+	u64 overrun;
+
 	lockdep_assert_held(&cfs_b->lock);
 
-	if (!cfs_b->period_active) {
-		cfs_b->period_active = 1;
-		hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-		hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
-	}
+	if (cfs_b->period_active)
+		return;
+
+	cfs_b->period_active = 1;
+	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
+	cfs_b->expires_seq++;
+	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-07-03  7:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-20 10:18 [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition Xunlei Pang
2018-06-20 10:18 ` [PATCH v2 2/2] sched/fair: Advance global expiration when period timer is restarted Xunlei Pang
2018-06-20 17:02   ` bsegall
2018-07-03  7:50   ` [tip:sched/core] " tip-bot for Xunlei Pang
2018-06-20 17:01 ` [PATCH v2 1/2] sched/fair: Fix bandwidth timer clock drift condition bsegall
2018-06-21  3:56   ` Xunlei Pang
2018-06-21  8:08     ` Peter Zijlstra
2018-06-21 10:51       ` Xunlei Pang
2018-07-03  7:50 ` [tip:sched/core] " tip-bot for Xunlei Pang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.