All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices
@ 2019-05-17 19:30 Dave Chiluk
  2019-05-23 18:44 ` [PATCH v2 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk
                   ` (5 more replies)
  0 siblings, 6 replies; 40+ messages in thread
From: Dave Chiluk @ 2019-05-17 19:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel,
	Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond,
	Cong Wang, Jonathan Corbet, linux-doc

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota.  This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period. At which
point the slice and quota expires.  This expiration of unused slice
results in applications not being able to utilize the quota for which
they are allocated.

The expiration of per-cpu slices was recently fixed by commit
512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition").
Prior to that it appears that this has been broken since a least commit
51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period") which was introduced in v3.16-rc1 in 2014.  That
commit added the following conditional which resulted in slices never
being expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows only per-cpu slices to live longer than the period boundary.
This allows threads on runqueues that do not use much CPU to continue to
use their remaining slice over a longer period of time than
cpu.cfs_period_us. However, this helps prevents the above condition of
hitting throttling while also not fully utilizing your cpu quota.

This theoretically allows a machine to use slightly more than it's
allotted quota in some periods.  This overflow would be bounded by the
remaining per-cpu slice that was left un-used in the previous period.
For CPU bound tasks this will change nothing, as they should
theoretically fully utilize all of their quota and slices in each
period. For user-interactive tasks as described above this provides a
much better user/application experience as their cpu utilization will
more closely match the amount they requested when they hit throttling.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase, this performance
discrepancy has been observed to be almost 30x performance improvement,
while still maintaining correct cpu quota restrictions albeit over
longer time intervals than cpu.cfs_period_us.  That testcase is available
at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
---
 Documentation/scheduler/sched-bwc.txt | 29 +++++++++++---
 kernel/sched/fair.c                   | 71 +++--------------------------------
 kernel/sched/sched.h                  |  4 --
 3 files changed, 29 insertions(+), 75 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..4ded8ae 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
 specification of the maximum CPU bandwidth available to a group or hierarchy.
 
 The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time.  When the CPU bandwidth consumption of a group
+exceeds this limit (for that period), the tasks belonging to its hierarchy will
+be throttled and are not allowed to run again until the next period.
 
 A group's unused runtime is globally tracked, being refreshed with quota units
 above at each period boundary.  As threads consume this bandwidth it is
 transferred to cpu-local "silos" on a demand basis.  The amount transferred
-within each of these updates is tunable and described as the "slice".
+within each of these updates is tunable and described as the "slice".  Slices
+that are allocated to cpu-local silos do not expire at the end of the period,
+but unallocated quota does.  This doesn't affect cpu-bound applications as they
+by definition consume all of their bandwidth in each each period.
+
+However for highly-threaded user-interactive/non-cpu bound applications this
+non-expiration nuance allows applications to burst past their quota limits
+equal to the amount of unused slice per cpu that the task group is running on.
+This slight burst requires that quota had gone unused in previous periods.
+Additionally this burst amount is limited to the size of a slice for every cpu
+a task group is run on.  As a result, this mechanism still strictly limits the
+task group to quota average usage over a longer time windows.  This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines.  It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu.  Another way to say this, is that by allowing the unused
+portion of a slice to be used in following periods we have decreased the
+possibility of wasting unused quota on cpu-local silos that don't need much cpu
+time.
 
 Management
 ----------
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..a675c69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount, expires;
-	int expires_seq;
+	u64 amount = 0, min_amount;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
-	expires_seq = cfs_b->expires_seq;
-	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
-	/*
-	 * we may have advanced our local expiration to account for allowed
-	 * spread between our sched_clock and the one on which runtime was
-	 * issued.
-	 */
-	if (cfs_rq->expires_seq != expires_seq) {
-		cfs_rq->expires_seq = expires_seq;
-		cfs_rq->runtime_expires = expires;
-	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
-		return;
-
-	if (cfs_rq->runtime_remaining < 0)
-		return;
-
-	/*
-	 * If the local deadline has passed we have to consider the
-	 * possibility that our sched_clock is 'fast' and the global deadline
-	 * has not truly expired.
-	 *
-	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline(cfs_b->expires_seq) has advanced.
-	 */
-	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
-		/* extend local deadline, drift is bounded above by 2 ticks */
-		cfs_rq->runtime_expires += TICK_NSEC;
-	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
-	}
-}
-
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	expire_cfs_rq_runtime(cfs_rq);
 
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
@@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		resched_curr(rq);
 }
 
-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
-		u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime;
@@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		remaining -= runtime;
 
 		cfs_rq->runtime_remaining += runtime;
-		cfs_rq->runtime_expires = expires;
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0)
@@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
-	u64 runtime, runtime_expires;
+	u64 runtime;
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
@@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	/* account preceding periods in which throttling occurred */
 	cfs_b->nr_throttled += overrun;
 
-	runtime_expires = cfs_b->runtime_expires;
-
 	/*
 	 * This check is repeated as we are holding onto the new bandwidth while
 	 * we unthrottle. This can potentially race with an unthrottled group
@@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 		cfs_b->distribute_running = 1;
 		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
+		runtime = distribute_cfs_runtime(cfs_b, runtime);
 		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
@@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
@@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
 	unsigned long flags;
-	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	expires = cfs_b->runtime_expires;
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
@@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (!runtime)
 		return;
 
-	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+	runtime = distribute_cfs_runtime(cfs_b, runtime);
 
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
-	if (expires == cfs_b->runtime_expires)
-		lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
 	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
@@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 
 	cfs_b->period_active = 1;
 	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..0c0ed23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,8 +341,6 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	s64			hierarchical_quota;
-	u64			runtime_expires;
-	int			expires_seq;
 
 	short			idle;
 	short			period_active;
@@ -562,8 +560,6 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
-	int			expires_seq;
-	u64			runtime_expires;
 	s64			runtime_remaining;
 
 	u64			throttled_clock;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk
@ 2019-05-23 18:44 ` Dave Chiluk
  2019-05-23 18:44   ` [PATCH v2 1/1] " Dave Chiluk
  2019-05-29 19:08 ` [PATCH v3 0/1] " Dave Chiluk
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 40+ messages in thread
From: Dave Chiluk @ 2019-05-23 18:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel,
	Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond,
	Cong Wang, Jonathan Corbet, linux-doc

Changelog v2
    - Fixed some checkpatch errors in the commit message.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-23 18:44 ` [PATCH v2 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk
@ 2019-05-23 18:44   ` Dave Chiluk
  2019-05-23 21:01     ` Peter Oskolkov
  2019-05-24  8:55     ` Peter Zijlstra
  0 siblings, 2 replies; 40+ messages in thread
From: Dave Chiluk @ 2019-05-23 18:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel,
	Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond,
	Cong Wang, Jonathan Corbet, linux-doc

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota.  This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period. At which
point the slice and quota expires.  This expiration of unused slice
results in applications not being able to utilize the quota for which
they are allocated.

The expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'.  Prior to that it appears that this has been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014.  That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows only per-cpu slices to live longer than the period boundary.
This allows threads on runqueues that do not use much CPU to continue to
use their remaining slice over a longer period of time than
cpu.cfs_period_us. However, this helps prevents the above condition of
hitting throttling while also not fully utilizing your cpu quota.

This theoretically allows a machine to use slightly more than it's
allotted quota in some periods.  This overflow would be bounded by the
remaining per-cpu slice that was left un-used in the previous period.
For CPU bound tasks this will change nothing, as they should
theoretically fully utilize all of their quota and slices in each
period. For user-interactive tasks as described above this provides a
much better user/application experience as their cpu utilization will
more closely match the amount they requested when they hit throttling.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase, this performance
discrepancy has been observed to be almost 30x performance improvement,
while still maintaining correct cpu quota restrictions albeit over
longer time intervals than cpu.cfs_period_us.  That testcase is
available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
---
 Documentation/scheduler/sched-bwc.txt | 29 +++++++++++---
 kernel/sched/fair.c                   | 71 +++--------------------------------
 kernel/sched/sched.h                  |  4 --
 3 files changed, 29 insertions(+), 75 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..4ded8ae 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
 specification of the maximum CPU bandwidth available to a group or hierarchy.
 
 The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time.  When the CPU bandwidth consumption of a group
+exceeds this limit (for that period), the tasks belonging to its hierarchy will
+be throttled and are not allowed to run again until the next period.
 
 A group's unused runtime is globally tracked, being refreshed with quota units
 above at each period boundary.  As threads consume this bandwidth it is
 transferred to cpu-local "silos" on a demand basis.  The amount transferred
-within each of these updates is tunable and described as the "slice".
+within each of these updates is tunable and described as the "slice".  Slices
+that are allocated to cpu-local silos do not expire at the end of the period,
+but unallocated quota does.  This doesn't affect cpu-bound applications as they
+by definition consume all of their bandwidth in each each period.
+
+However for highly-threaded user-interactive/non-cpu bound applications this
+non-expiration nuance allows applications to burst past their quota limits
+equal to the amount of unused slice per cpu that the task group is running on.
+This slight burst requires that quota had gone unused in previous periods.
+Additionally this burst amount is limited to the size of a slice for every cpu
+a task group is run on.  As a result, this mechanism still strictly limits the
+task group to quota average usage over a longer time windows.  This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines.  It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu.  Another way to say this, is that by allowing the unused
+portion of a slice to be used in following periods we have decreased the
+possibility of wasting unused quota on cpu-local silos that don't need much cpu
+time.
 
 Management
 ----------
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..a675c69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount, expires;
-	int expires_seq;
+	u64 amount = 0, min_amount;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
-	expires_seq = cfs_b->expires_seq;
-	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
-	/*
-	 * we may have advanced our local expiration to account for allowed
-	 * spread between our sched_clock and the one on which runtime was
-	 * issued.
-	 */
-	if (cfs_rq->expires_seq != expires_seq) {
-		cfs_rq->expires_seq = expires_seq;
-		cfs_rq->runtime_expires = expires;
-	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
-		return;
-
-	if (cfs_rq->runtime_remaining < 0)
-		return;
-
-	/*
-	 * If the local deadline has passed we have to consider the
-	 * possibility that our sched_clock is 'fast' and the global deadline
-	 * has not truly expired.
-	 *
-	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline(cfs_b->expires_seq) has advanced.
-	 */
-	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
-		/* extend local deadline, drift is bounded above by 2 ticks */
-		cfs_rq->runtime_expires += TICK_NSEC;
-	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
-	}
-}
-
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	expire_cfs_rq_runtime(cfs_rq);
 
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
@@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		resched_curr(rq);
 }
 
-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
-		u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime;
@@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		remaining -= runtime;
 
 		cfs_rq->runtime_remaining += runtime;
-		cfs_rq->runtime_expires = expires;
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0)
@@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
-	u64 runtime, runtime_expires;
+	u64 runtime;
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
@@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	/* account preceding periods in which throttling occurred */
 	cfs_b->nr_throttled += overrun;
 
-	runtime_expires = cfs_b->runtime_expires;
-
 	/*
 	 * This check is repeated as we are holding onto the new bandwidth while
 	 * we unthrottle. This can potentially race with an unthrottled group
@@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 		cfs_b->distribute_running = 1;
 		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
+		runtime = distribute_cfs_runtime(cfs_b, runtime);
 		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
@@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
@@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
 	unsigned long flags;
-	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	expires = cfs_b->runtime_expires;
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
@@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (!runtime)
 		return;
 
-	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+	runtime = distribute_cfs_runtime(cfs_b, runtime);
 
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
-	if (expires == cfs_b->runtime_expires)
-		lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
 	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
@@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 
 	cfs_b->period_active = 1;
 	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..0c0ed23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,8 +341,6 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	s64			hierarchical_quota;
-	u64			runtime_expires;
-	int			expires_seq;
 
 	short			idle;
 	short			period_active;
@@ -562,8 +560,6 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
-	int			expires_seq;
-	u64			runtime_expires;
 	s64			runtime_remaining;
 
 	u64			throttled_clock;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-23 18:44   ` [PATCH v2 1/1] " Dave Chiluk
@ 2019-05-23 21:01     ` Peter Oskolkov
  2019-05-24 14:32       ` Phil Auld
  2019-05-24  8:55     ` Peter Zijlstra
  1 sibling, 1 reply; 40+ messages in thread
From: Peter Oskolkov @ 2019-05-23 21:01 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Peter Zijlstra, Ingo Molnar, cgroups, Linux Kernel Mailing List,
	Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond,
	Cong Wang, Jonathan Corbet, linux-doc

On Thu, May 23, 2019 at 11:44 AM Dave Chiluk <chiluk+linux@indeed.com> wrote:
>
> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota.  This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. At which
> point the slice and quota expires.  This expiration of unused slice
> results in applications not being able to utilize the quota for which
> they are allocated.
>
> The expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'.  Prior to that it appears that this has been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014.  That
> added the following conditional which resulted in slices never being
> expired.
>
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
>         /* extend local deadline, drift is bounded above by 2 ticks */
>         cfs_rq->runtime_expires += TICK_NSEC;
>
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
>
> This allows only per-cpu slices to live longer than the period boundary.
> This allows threads on runqueues that do not use much CPU to continue to
> use their remaining slice over a longer period of time than
> cpu.cfs_period_us. However, this helps prevents the above condition of
> hitting throttling while also not fully utilizing your cpu quota.
>
> This theoretically allows a machine to use slightly more than it's
> allotted quota in some periods.  This overflow would be bounded by the
> remaining per-cpu slice that was left un-used in the previous period.
> For CPU bound tasks this will change nothing, as they should
> theoretically fully utilize all of their quota and slices in each
> period. For user-interactive tasks as described above this provides a
> much better user/application experience as their cpu utilization will
> more closely match the amount they requested when they hit throttling.
>
> This greatly improves performance of high-thread-count, non-cpu bound
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase, this performance
> discrepancy has been observed to be almost 30x performance improvement,
> while still maintaining correct cpu quota restrictions albeit over
> longer time intervals than cpu.cfs_period_us.

If the machine runs at/close to capacity, won't the overallocation
of the quota to bursty tasks necessarily negatively impact every other
task? Should the "unused" quota be available only on idle CPUs?
(Or maybe this is the behavior achieved here, and only the comment and
the commit message should be fixed...)

>  That testcase is
> available at https://github.com/indeedeng/fibtest.
>
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
> ---
>  Documentation/scheduler/sched-bwc.txt | 29 +++++++++++---
>  kernel/sched/fair.c                   | 71 +++--------------------------------
>  kernel/sched/sched.h                  |  4 --
>  3 files changed, 29 insertions(+), 75 deletions(-)
>
> diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
> index f6b1873..4ded8ae 100644
> --- a/Documentation/scheduler/sched-bwc.txt
> +++ b/Documentation/scheduler/sched-bwc.txt
> @@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
>  specification of the maximum CPU bandwidth available to a group or hierarchy.
>
>  The bandwidth allowed for a group is specified using a quota and period. Within
> -each given "period" (microseconds), a group is allowed to consume only up to
> -"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
> -group exceeds this limit (for that period), the tasks belonging to its
> -hierarchy will be throttled and are not allowed to run again until the next
> -period.
> +each given "period" (microseconds), a task group is allocated up to "quota"
> +microseconds of CPU time.  When the CPU bandwidth consumption of a group
> +exceeds this limit (for that period), the tasks belonging to its hierarchy will
> +be throttled and are not allowed to run again until the next period.
>
>  A group's unused runtime is globally tracked, being refreshed with quota units
>  above at each period boundary.  As threads consume this bandwidth it is
>  transferred to cpu-local "silos" on a demand basis.  The amount transferred
> -within each of these updates is tunable and described as the "slice".
> +within each of these updates is tunable and described as the "slice".  Slices
> +that are allocated to cpu-local silos do not expire at the end of the period,
> +but unallocated quota does.  This doesn't affect cpu-bound applications as they
> +by definition consume all of their bandwidth in each each period.
> +
> +However for highly-threaded user-interactive/non-cpu bound applications this
> +non-expiration nuance allows applications to burst past their quota limits
> +equal to the amount of unused slice per cpu that the task group is running on.
> +This slight burst requires that quota had gone unused in previous periods.
> +Additionally this burst amount is limited to the size of a slice for every cpu
> +a task group is run on.  As a result, this mechanism still strictly limits the
> +task group to quota average usage over a longer time windows.  This provides
> +better more predictable user experience for highly threaded applications with
> +small quota limits on high core count machines.  It also eliminates the
> +propensity to throttle these applications while simultanously using less than
> +quota amounts of cpu.  Another way to say this, is that by allowing the unused
> +portion of a slice to be used in following periods we have decreased the
> +possibility of wasting unused quota on cpu-local silos that don't need much cpu
> +time.
>
>  Management
>  ----------
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..a675c69 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>
>         now = sched_clock_cpu(smp_processor_id());
>         cfs_b->runtime = cfs_b->quota;
> -       cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> -       cfs_b->expires_seq++;
>  }
>
>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> @@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  {
>         struct task_group *tg = cfs_rq->tg;
>         struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> -       u64 amount = 0, min_amount, expires;
> -       int expires_seq;
> +       u64 amount = 0, min_amount;
>
>         /* note: this is a positive sum as runtime_remaining <= 0 */
>         min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> @@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>                         cfs_b->idle = 0;
>                 }
>         }
> -       expires_seq = cfs_b->expires_seq;
> -       expires = cfs_b->runtime_expires;
>         raw_spin_unlock(&cfs_b->lock);
>
>         cfs_rq->runtime_remaining += amount;
> -       /*
> -        * we may have advanced our local expiration to account for allowed
> -        * spread between our sched_clock and the one on which runtime was
> -        * issued.
> -        */
> -       if (cfs_rq->expires_seq != expires_seq) {
> -               cfs_rq->expires_seq = expires_seq;
> -               cfs_rq->runtime_expires = expires;
> -       }
>
>         return cfs_rq->runtime_remaining > 0;
>  }
>
> -/*
> - * Note: This depends on the synchronization provided by sched_clock and the
> - * fact that rq->clock snapshots this value.
> - */
> -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> -{
> -       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> -
> -       /* if the deadline is ahead of our clock, nothing to do */
> -       if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> -               return;
> -
> -       if (cfs_rq->runtime_remaining < 0)
> -               return;
> -
> -       /*
> -        * If the local deadline has passed we have to consider the
> -        * possibility that our sched_clock is 'fast' and the global deadline
> -        * has not truly expired.
> -        *
> -        * Fortunately we can check determine whether this the case by checking
> -        * whether the global deadline(cfs_b->expires_seq) has advanced.
> -        */
> -       if (cfs_rq->expires_seq == cfs_b->expires_seq) {
> -               /* extend local deadline, drift is bounded above by 2 ticks */
> -               cfs_rq->runtime_expires += TICK_NSEC;
> -       } else {
> -               /* global deadline is ahead, expiration has passed */
> -               cfs_rq->runtime_remaining = 0;
> -       }
> -}
> -
>  static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>  {
>         /* dock delta_exec before expiring quota (as it could span periods) */
>         cfs_rq->runtime_remaining -= delta_exec;
> -       expire_cfs_rq_runtime(cfs_rq);
>
>         if (likely(cfs_rq->runtime_remaining > 0))
>                 return;
> @@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>                 resched_curr(rq);
>  }
>
> -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> -               u64 remaining, u64 expires)
> +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>  {
>         struct cfs_rq *cfs_rq;
>         u64 runtime;
> @@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>                 remaining -= runtime;
>
>                 cfs_rq->runtime_remaining += runtime;
> -               cfs_rq->runtime_expires = expires;
>
>                 /* we check whether we're throttled above */
>                 if (cfs_rq->runtime_remaining > 0)
> @@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>   */
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
>  {
> -       u64 runtime, runtime_expires;
> +       u64 runtime;
>         int throttled;
>
>         /* no need to continue the timer with no bandwidth constraint */
> @@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
>         /* account preceding periods in which throttling occurred */
>         cfs_b->nr_throttled += overrun;
>
> -       runtime_expires = cfs_b->runtime_expires;
> -
>         /*
>          * This check is repeated as we are holding onto the new bandwidth while
>          * we unthrottle. This can potentially race with an unthrottled group
> @@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
>                 cfs_b->distribute_running = 1;
>                 raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>                 /* we can't nest cfs_b->lock while distributing bandwidth */
> -               runtime = distribute_cfs_runtime(cfs_b, runtime,
> -                                                runtime_expires);
> +               runtime = distribute_cfs_runtime(cfs_b, runtime);
>                 raw_spin_lock_irqsave(&cfs_b->lock, flags);
>
>                 cfs_b->distribute_running = 0;
> @@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>                 return;
>
>         raw_spin_lock(&cfs_b->lock);
> -       if (cfs_b->quota != RUNTIME_INF &&
> -           cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> +       if (cfs_b->quota != RUNTIME_INF) {
>                 cfs_b->runtime += slack_runtime;
>
>                 /* we are under rq->lock, defer unthrottling using a timer */
> @@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  {
>         u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
>         unsigned long flags;
> -       u64 expires;
>
>         /* confirm we're still not at a refresh boundary */
>         raw_spin_lock_irqsave(&cfs_b->lock, flags);
> @@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>         if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
>                 runtime = cfs_b->runtime;
>
> -       expires = cfs_b->runtime_expires;
>         if (runtime)
>                 cfs_b->distribute_running = 1;
>
> @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>         if (!runtime)
>                 return;
>
> -       runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> +       runtime = distribute_cfs_runtime(cfs_b, runtime);
>
>         raw_spin_lock_irqsave(&cfs_b->lock, flags);
> -       if (expires == cfs_b->runtime_expires)
> -               lsub_positive(&cfs_b->runtime, runtime);
>         cfs_b->distribute_running = 0;
>         raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  }
> @@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>
>         cfs_b->period_active = 1;
>         overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> -       cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> -       cfs_b->expires_seq++;
>         hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>  }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b52ed1a..0c0ed23 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -341,8 +341,6 @@ struct cfs_bandwidth {
>         u64                     quota;
>         u64                     runtime;
>         s64                     hierarchical_quota;
> -       u64                     runtime_expires;
> -       int                     expires_seq;
>
>         short                   idle;
>         short                   period_active;
> @@ -562,8 +560,6 @@ struct cfs_rq {
>
>  #ifdef CONFIG_CFS_BANDWIDTH
>         int                     runtime_enabled;
> -       int                     expires_seq;
> -       u64                     runtime_expires;
>         s64                     runtime_remaining;
>
>         u64                     throttled_clock;
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-23 18:44   ` [PATCH v2 1/1] " Dave Chiluk
  2019-05-23 21:01     ` Peter Oskolkov
@ 2019-05-24  8:55     ` Peter Zijlstra
  1 sibling, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2019-05-24  8:55 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc, Ben Segall, Paul Turner


(it always helps to Cc the people who actually wrote the code)

Ben, can you have a look at this?

On Thu, May 23, 2019 at 01:44:47PM -0500, Dave Chiluk wrote:
> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota.  This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
> 
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. At which
> point the slice and quota expires.  This expiration of unused slice
> results in applications not being able to utilize the quota for which
> they are allocated.
> 
> The expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'.  Prior to that it appears that this has been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014.  That
> added the following conditional which resulted in slices never being
> expired.
> 
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> 	/* extend local deadline, drift is bounded above by 2 ticks */
> 	cfs_rq->runtime_expires += TICK_NSEC;
> 
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
> 
> This allows only per-cpu slices to live longer than the period boundary.
> This allows threads on runqueues that do not use much CPU to continue to
> use their remaining slice over a longer period of time than
> cpu.cfs_period_us. However, this helps prevents the above condition of
> hitting throttling while also not fully utilizing your cpu quota.
> 
> This theoretically allows a machine to use slightly more than it's
> allotted quota in some periods.  This overflow would be bounded by the
> remaining per-cpu slice that was left un-used in the previous period.
> For CPU bound tasks this will change nothing, as they should
> theoretically fully utilize all of their quota and slices in each
> period. For user-interactive tasks as described above this provides a
> much better user/application experience as their cpu utilization will
> more closely match the amount they requested when they hit throttling.
> 
> This greatly improves performance of high-thread-count, non-cpu bound
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase, this performance
> discrepancy has been observed to be almost 30x performance improvement,
> while still maintaining correct cpu quota restrictions albeit over
> longer time intervals than cpu.cfs_period_us.  That testcase is
> available at https://github.com/indeedeng/fibtest.
> 
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
> ---
>  Documentation/scheduler/sched-bwc.txt | 29 +++++++++++---
>  kernel/sched/fair.c                   | 71 +++--------------------------------
>  kernel/sched/sched.h                  |  4 --
>  3 files changed, 29 insertions(+), 75 deletions(-)
> 
> diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
> index f6b1873..4ded8ae 100644
> --- a/Documentation/scheduler/sched-bwc.txt
> +++ b/Documentation/scheduler/sched-bwc.txt
> @@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
>  specification of the maximum CPU bandwidth available to a group or hierarchy.
>  
>  The bandwidth allowed for a group is specified using a quota and period. Within
> -each given "period" (microseconds), a group is allowed to consume only up to
> -"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
> -group exceeds this limit (for that period), the tasks belonging to its
> -hierarchy will be throttled and are not allowed to run again until the next
> -period.
> +each given "period" (microseconds), a task group is allocated up to "quota"
> +microseconds of CPU time.  When the CPU bandwidth consumption of a group
> +exceeds this limit (for that period), the tasks belonging to its hierarchy will
> +be throttled and are not allowed to run again until the next period.
>  
>  A group's unused runtime is globally tracked, being refreshed with quota units
>  above at each period boundary.  As threads consume this bandwidth it is
>  transferred to cpu-local "silos" on a demand basis.  The amount transferred
> -within each of these updates is tunable and described as the "slice".
> +within each of these updates is tunable and described as the "slice".  Slices
> +that are allocated to cpu-local silos do not expire at the end of the period,
> +but unallocated quota does.  This doesn't affect cpu-bound applications as they
> +by definition consume all of their bandwidth in each each period.
> +
> +However for highly-threaded user-interactive/non-cpu bound applications this
> +non-expiration nuance allows applications to burst past their quota limits
> +equal to the amount of unused slice per cpu that the task group is running on.
> +This slight burst requires that quota had gone unused in previous periods.
> +Additionally this burst amount is limited to the size of a slice for every cpu
> +a task group is run on.  As a result, this mechanism still strictly limits the
> +task group to quota average usage over a longer time windows.  This provides
> +better more predictable user experience for highly threaded applications with
> +small quota limits on high core count machines.  It also eliminates the
> +propensity to throttle these applications while simultanously using less than
> +quota amounts of cpu.  Another way to say this, is that by allowing the unused
> +portion of a slice to be used in following periods we have decreased the
> +possibility of wasting unused quota on cpu-local silos that don't need much cpu
> +time.
>  
>  Management
>  ----------
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..a675c69 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>  
>  	now = sched_clock_cpu(smp_processor_id());
>  	cfs_b->runtime = cfs_b->quota;
> -	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> -	cfs_b->expires_seq++;
>  }
>  
>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> @@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  {
>  	struct task_group *tg = cfs_rq->tg;
>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> -	u64 amount = 0, min_amount, expires;
> -	int expires_seq;
> +	u64 amount = 0, min_amount;
>  
>  	/* note: this is a positive sum as runtime_remaining <= 0 */
>  	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> @@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  			cfs_b->idle = 0;
>  		}
>  	}
> -	expires_seq = cfs_b->expires_seq;
> -	expires = cfs_b->runtime_expires;
>  	raw_spin_unlock(&cfs_b->lock);
>  
>  	cfs_rq->runtime_remaining += amount;
> -	/*
> -	 * we may have advanced our local expiration to account for allowed
> -	 * spread between our sched_clock and the one on which runtime was
> -	 * issued.
> -	 */
> -	if (cfs_rq->expires_seq != expires_seq) {
> -		cfs_rq->expires_seq = expires_seq;
> -		cfs_rq->runtime_expires = expires;
> -	}
>  
>  	return cfs_rq->runtime_remaining > 0;
>  }
>  
> -/*
> - * Note: This depends on the synchronization provided by sched_clock and the
> - * fact that rq->clock snapshots this value.
> - */
> -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> -{
> -	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> -
> -	/* if the deadline is ahead of our clock, nothing to do */
> -	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> -		return;
> -
> -	if (cfs_rq->runtime_remaining < 0)
> -		return;
> -
> -	/*
> -	 * If the local deadline has passed we have to consider the
> -	 * possibility that our sched_clock is 'fast' and the global deadline
> -	 * has not truly expired.
> -	 *
> -	 * Fortunately we can check determine whether this the case by checking
> -	 * whether the global deadline(cfs_b->expires_seq) has advanced.
> -	 */
> -	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
> -		/* extend local deadline, drift is bounded above by 2 ticks */
> -		cfs_rq->runtime_expires += TICK_NSEC;
> -	} else {
> -		/* global deadline is ahead, expiration has passed */
> -		cfs_rq->runtime_remaining = 0;
> -	}
> -}
> -
>  static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>  {
>  	/* dock delta_exec before expiring quota (as it could span periods) */
>  	cfs_rq->runtime_remaining -= delta_exec;
> -	expire_cfs_rq_runtime(cfs_rq);
>  
>  	if (likely(cfs_rq->runtime_remaining > 0))
>  		return;
> @@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  		resched_curr(rq);
>  }
>  
> -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> -		u64 remaining, u64 expires)
> +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>  {
>  	struct cfs_rq *cfs_rq;
>  	u64 runtime;
> @@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>  		remaining -= runtime;
>  
>  		cfs_rq->runtime_remaining += runtime;
> -		cfs_rq->runtime_expires = expires;
>  
>  		/* we check whether we're throttled above */
>  		if (cfs_rq->runtime_remaining > 0)
> @@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>   */
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
>  {
> -	u64 runtime, runtime_expires;
> +	u64 runtime;
>  	int throttled;
>  
>  	/* no need to continue the timer with no bandwidth constraint */
> @@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
>  	/* account preceding periods in which throttling occurred */
>  	cfs_b->nr_throttled += overrun;
>  
> -	runtime_expires = cfs_b->runtime_expires;
> -
>  	/*
>  	 * This check is repeated as we are holding onto the new bandwidth while
>  	 * we unthrottle. This can potentially race with an unthrottled group
> @@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
>  		cfs_b->distribute_running = 1;
>  		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  		/* we can't nest cfs_b->lock while distributing bandwidth */
> -		runtime = distribute_cfs_runtime(cfs_b, runtime,
> -						 runtime_expires);
> +		runtime = distribute_cfs_runtime(cfs_b, runtime);
>  		raw_spin_lock_irqsave(&cfs_b->lock, flags);
>  
>  		cfs_b->distribute_running = 0;
> @@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  		return;
>  
>  	raw_spin_lock(&cfs_b->lock);
> -	if (cfs_b->quota != RUNTIME_INF &&
> -	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> +	if (cfs_b->quota != RUNTIME_INF) {
>  		cfs_b->runtime += slack_runtime;
>  
>  		/* we are under rq->lock, defer unthrottling using a timer */
> @@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  {
>  	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
>  	unsigned long flags;
> -	u64 expires;
>  
>  	/* confirm we're still not at a refresh boundary */
>  	raw_spin_lock_irqsave(&cfs_b->lock, flags);
> @@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
>  		runtime = cfs_b->runtime;
>  
> -	expires = cfs_b->runtime_expires;
>  	if (runtime)
>  		cfs_b->distribute_running = 1;
>  
> @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  	if (!runtime)
>  		return;
>  
> -	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> +	runtime = distribute_cfs_runtime(cfs_b, runtime);
>  
>  	raw_spin_lock_irqsave(&cfs_b->lock, flags);
> -	if (expires == cfs_b->runtime_expires)
> -		lsub_positive(&cfs_b->runtime, runtime);
>  	cfs_b->distribute_running = 0;
>  	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  }
> @@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  
>  	cfs_b->period_active = 1;
>  	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> -	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> -	cfs_b->expires_seq++;
>  	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>  }
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b52ed1a..0c0ed23 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -341,8 +341,6 @@ struct cfs_bandwidth {
>  	u64			quota;
>  	u64			runtime;
>  	s64			hierarchical_quota;
> -	u64			runtime_expires;
> -	int			expires_seq;
>  
>  	short			idle;
>  	short			period_active;
> @@ -562,8 +560,6 @@ struct cfs_rq {
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
>  	int			runtime_enabled;
> -	int			expires_seq;
> -	u64			runtime_expires;
>  	s64			runtime_remaining;
>  
>  	u64			throttled_clock;
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-23 21:01     ` Peter Oskolkov
@ 2019-05-24 14:32       ` Phil Auld
  2019-05-24 15:14         ` Dave Chiluk
  0 siblings, 1 reply; 40+ messages in thread
From: Phil Auld @ 2019-05-24 14:32 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Dave Chiluk, Peter Zijlstra, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc

On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:
> On Thu, May 23, 2019 at 11:44 AM Dave Chiluk <chiluk+linux@indeed.com> wrote:
> >
> > It has been observed, that highly-threaded, non-cpu-bound applications
> > running under cpu.cfs_quota_us constraints can hit a high percentage of
> > periods throttled while simultaneously not consuming the allocated
> > amount of quota.  This use case is typical of user-interactive non-cpu
> > bound applications, such as those running in kubernetes or mesos when
> > run on multiple cpu cores.
> >
> > This has been root caused to threads being allocated per cpu bandwidth
> > slices, and then not fully using that slice within the period. At which
> > point the slice and quota expires.  This expiration of unused slice
> > results in applications not being able to utilize the quota for which
> > they are allocated.
> >
> > The expiration of per-cpu slices was recently fixed by
> > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> > condition")'.  Prior to that it appears that this has been broken since
> > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014.  That
> > added the following conditional which resulted in slices never being
> > expired.
> >
> > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> >         /* extend local deadline, drift is bounded above by 2 ticks */
> >         cfs_rq->runtime_expires += TICK_NSEC;
> >
> > Because this was broken for nearly 5 years, and has recently been fixed
> > and is now being noticed by many users running kubernetes
> > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> > that the mechanisms around expiring runtime should be removed
> > altogether.
> >
> > This allows only per-cpu slices to live longer than the period boundary.
> > This allows threads on runqueues that do not use much CPU to continue to
> > use their remaining slice over a longer period of time than
> > cpu.cfs_period_us. However, this helps prevents the above condition of
> > hitting throttling while also not fully utilizing your cpu quota.
> >
> > This theoretically allows a machine to use slightly more than it's
> > allotted quota in some periods.  This overflow would be bounded by the
> > remaining per-cpu slice that was left un-used in the previous period.
> > For CPU bound tasks this will change nothing, as they should
> > theoretically fully utilize all of their quota and slices in each
> > period. For user-interactive tasks as described above this provides a
> > much better user/application experience as their cpu utilization will
> > more closely match the amount they requested when they hit throttling.
> >
> > This greatly improves performance of high-thread-count, non-cpu bound
> > applications with low cfs_quota_us allocation on high-core-count
> > machines. In the case of an artificial testcase, this performance
> > discrepancy has been observed to be almost 30x performance improvement,
> > while still maintaining correct cpu quota restrictions albeit over
> > longer time intervals than cpu.cfs_period_us.
> 
> If the machine runs at/close to capacity, won't the overallocation
> of the quota to bursty tasks necessarily negatively impact every other
> task? Should the "unused" quota be available only on idle CPUs?
> (Or maybe this is the behavior achieved here, and only the comment and
> the commit message should be fixed...)
>

It's bounded by the amount left unused from the previous period. So
theoretically a process could use almost twice its quota. But then it
would have nothing left over in the next period. To repeat it would have
to not use any that next period. Over a longer number of periods it's the
same amount of CPU usage.

I think that is more fair than throttling a process that has never used
its full quota.

And it removes complexity. 

Cheers,
Phil

> >  That testcase is
> > available at https://github.com/indeedeng/fibtest.
> >
> > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
> > ---
> >  Documentation/scheduler/sched-bwc.txt | 29 +++++++++++---
> >  kernel/sched/fair.c                   | 71 +++--------------------------------
> >  kernel/sched/sched.h                  |  4 --
> >  3 files changed, 29 insertions(+), 75 deletions(-)
> >
> > diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
> > index f6b1873..4ded8ae 100644
> > --- a/Documentation/scheduler/sched-bwc.txt
> > +++ b/Documentation/scheduler/sched-bwc.txt
> > @@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
> >  specification of the maximum CPU bandwidth available to a group or hierarchy.
> >
> >  The bandwidth allowed for a group is specified using a quota and period. Within
> > -each given "period" (microseconds), a group is allowed to consume only up to
> > -"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
> > -group exceeds this limit (for that period), the tasks belonging to its
> > -hierarchy will be throttled and are not allowed to run again until the next
> > -period.
> > +each given "period" (microseconds), a task group is allocated up to "quota"
> > +microseconds of CPU time.  When the CPU bandwidth consumption of a group
> > +exceeds this limit (for that period), the tasks belonging to its hierarchy will
> > +be throttled and are not allowed to run again until the next period.
> >
> >  A group's unused runtime is globally tracked, being refreshed with quota units
> >  above at each period boundary.  As threads consume this bandwidth it is
> >  transferred to cpu-local "silos" on a demand basis.  The amount transferred
> > -within each of these updates is tunable and described as the "slice".
> > +within each of these updates is tunable and described as the "slice".  Slices
> > +that are allocated to cpu-local silos do not expire at the end of the period,
> > +but unallocated quota does.  This doesn't affect cpu-bound applications as they
> > +by definition consume all of their bandwidth in each each period.
> > +
> > +However for highly-threaded user-interactive/non-cpu bound applications this
> > +non-expiration nuance allows applications to burst past their quota limits
> > +equal to the amount of unused slice per cpu that the task group is running on.
> > +This slight burst requires that quota had gone unused in previous periods.
> > +Additionally this burst amount is limited to the size of a slice for every cpu
> > +a task group is run on.  As a result, this mechanism still strictly limits the
> > +task group to quota average usage over a longer time windows.  This provides
> > +better more predictable user experience for highly threaded applications with
> > +small quota limits on high core count machines.  It also eliminates the
> > +propensity to throttle these applications while simultanously using less than
> > +quota amounts of cpu.  Another way to say this, is that by allowing the unused
> > +portion of a slice to be used in following periods we have decreased the
> > +possibility of wasting unused quota on cpu-local silos that don't need much cpu
> > +time.
> >
> >  Management
> >  ----------
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f35930f..a675c69 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
> >
> >         now = sched_clock_cpu(smp_processor_id());
> >         cfs_b->runtime = cfs_b->quota;
> > -       cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> > -       cfs_b->expires_seq++;
> >  }
> >
> >  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> > @@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >  {
> >         struct task_group *tg = cfs_rq->tg;
> >         struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> > -       u64 amount = 0, min_amount, expires;
> > -       int expires_seq;
> > +       u64 amount = 0, min_amount;
> >
> >         /* note: this is a positive sum as runtime_remaining <= 0 */
> >         min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> > @@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >                         cfs_b->idle = 0;
> >                 }
> >         }
> > -       expires_seq = cfs_b->expires_seq;
> > -       expires = cfs_b->runtime_expires;
> >         raw_spin_unlock(&cfs_b->lock);
> >
> >         cfs_rq->runtime_remaining += amount;
> > -       /*
> > -        * we may have advanced our local expiration to account for allowed
> > -        * spread between our sched_clock and the one on which runtime was
> > -        * issued.
> > -        */
> > -       if (cfs_rq->expires_seq != expires_seq) {
> > -               cfs_rq->expires_seq = expires_seq;
> > -               cfs_rq->runtime_expires = expires;
> > -       }
> >
> >         return cfs_rq->runtime_remaining > 0;
> >  }
> >
> > -/*
> > - * Note: This depends on the synchronization provided by sched_clock and the
> > - * fact that rq->clock snapshots this value.
> > - */
> > -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > -{
> > -       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> > -
> > -       /* if the deadline is ahead of our clock, nothing to do */
> > -       if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> > -               return;
> > -
> > -       if (cfs_rq->runtime_remaining < 0)
> > -               return;
> > -
> > -       /*
> > -        * If the local deadline has passed we have to consider the
> > -        * possibility that our sched_clock is 'fast' and the global deadline
> > -        * has not truly expired.
> > -        *
> > -        * Fortunately we can check determine whether this the case by checking
> > -        * whether the global deadline(cfs_b->expires_seq) has advanced.
> > -        */
> > -       if (cfs_rq->expires_seq == cfs_b->expires_seq) {
> > -               /* extend local deadline, drift is bounded above by 2 ticks */
> > -               cfs_rq->runtime_expires += TICK_NSEC;
> > -       } else {
> > -               /* global deadline is ahead, expiration has passed */
> > -               cfs_rq->runtime_remaining = 0;
> > -       }
> > -}
> > -
> >  static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
> >  {
> >         /* dock delta_exec before expiring quota (as it could span periods) */
> >         cfs_rq->runtime_remaining -= delta_exec;
> > -       expire_cfs_rq_runtime(cfs_rq);
> >
> >         if (likely(cfs_rq->runtime_remaining > 0))
> >                 return;
> > @@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> >                 resched_curr(rq);
> >  }
> >
> > -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> > -               u64 remaining, u64 expires)
> > +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
> >  {
> >         struct cfs_rq *cfs_rq;
> >         u64 runtime;
> > @@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> >                 remaining -= runtime;
> >
> >                 cfs_rq->runtime_remaining += runtime;
> > -               cfs_rq->runtime_expires = expires;
> >
> >                 /* we check whether we're throttled above */
> >                 if (cfs_rq->runtime_remaining > 0)
> > @@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> >   */
> >  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
> >  {
> > -       u64 runtime, runtime_expires;
> > +       u64 runtime;
> >         int throttled;
> >
> >         /* no need to continue the timer with no bandwidth constraint */
> > @@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> >         /* account preceding periods in which throttling occurred */
> >         cfs_b->nr_throttled += overrun;
> >
> > -       runtime_expires = cfs_b->runtime_expires;
> > -
> >         /*
> >          * This check is repeated as we are holding onto the new bandwidth while
> >          * we unthrottle. This can potentially race with an unthrottled group
> > @@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> >                 cfs_b->distribute_running = 1;
> >                 raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> >                 /* we can't nest cfs_b->lock while distributing bandwidth */
> > -               runtime = distribute_cfs_runtime(cfs_b, runtime,
> > -                                                runtime_expires);
> > +               runtime = distribute_cfs_runtime(cfs_b, runtime);
> >                 raw_spin_lock_irqsave(&cfs_b->lock, flags);
> >
> >                 cfs_b->distribute_running = 0;
> > @@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >                 return;
> >
> >         raw_spin_lock(&cfs_b->lock);
> > -       if (cfs_b->quota != RUNTIME_INF &&
> > -           cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> > +       if (cfs_b->quota != RUNTIME_INF) {
> >                 cfs_b->runtime += slack_runtime;
> >
> >                 /* we are under rq->lock, defer unthrottling using a timer */
> > @@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> >  {
> >         u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> >         unsigned long flags;
> > -       u64 expires;
> >
> >         /* confirm we're still not at a refresh boundary */
> >         raw_spin_lock_irqsave(&cfs_b->lock, flags);
> > @@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> >         if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
> >                 runtime = cfs_b->runtime;
> >
> > -       expires = cfs_b->runtime_expires;
> >         if (runtime)
> >                 cfs_b->distribute_running = 1;
> >
> > @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> >         if (!runtime)
> >                 return;
> >
> > -       runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> > +       runtime = distribute_cfs_runtime(cfs_b, runtime);
> >
> >         raw_spin_lock_irqsave(&cfs_b->lock, flags);
> > -       if (expires == cfs_b->runtime_expires)
> > -               lsub_positive(&cfs_b->runtime, runtime);
> >         cfs_b->distribute_running = 0;
> >         raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> >  }
> > @@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> >
> >         cfs_b->period_active = 1;
> >         overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> > -       cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> > -       cfs_b->expires_seq++;
> >         hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> >  }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index b52ed1a..0c0ed23 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -341,8 +341,6 @@ struct cfs_bandwidth {
> >         u64                     quota;
> >         u64                     runtime;
> >         s64                     hierarchical_quota;
> > -       u64                     runtime_expires;
> > -       int                     expires_seq;
> >
> >         short                   idle;
> >         short                   period_active;
> > @@ -562,8 +560,6 @@ struct cfs_rq {
> >
> >  #ifdef CONFIG_CFS_BANDWIDTH
> >         int                     runtime_enabled;
> > -       int                     expires_seq;
> > -       u64                     runtime_expires;
> >         s64                     runtime_remaining;
> >
> >         u64                     throttled_clock;
> > --
> > 1.8.3.1
> >

-- 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-24 14:32       ` Phil Auld
@ 2019-05-24 15:14         ` Dave Chiluk
  2019-05-24 15:59           ` Phil Auld
  2019-05-24 16:28           ` Peter Oskolkov
  0 siblings, 2 replies; 40+ messages in thread
From: Dave Chiluk @ 2019-05-24 15:14 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc, Ben Segall

On Fri, May 24, 2019 at 9:32 AM Phil Auld <pauld@redhat.com> wrote:
> On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:

> > If the machine runs at/close to capacity, won't the overallocation
> > of the quota to bursty tasks necessarily negatively impact every other
> > task? Should the "unused" quota be available only on idle CPUs?
> > (Or maybe this is the behavior achieved here, and only the comment and
> > the commit message should be fixed...)
> >
>
> It's bounded by the amount left unused from the previous period. So
> theoretically a process could use almost twice its quota. But then it
> would have nothing left over in the next period. To repeat it would have
> to not use any that next period. Over a longer number of periods it's the
> same amount of CPU usage.
>
> I think that is more fair than throttling a process that has never used
> its full quota.
>
> And it removes complexity.
>
> Cheers,
> Phil

Actually it's not even that bad.  The overallocation of quota to a
bursty task in a period is limited to at most one slice per cpu, and
that slice must not have been used in the previous periods.  The slice
size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and
defaults to 5ms.  If a bursty task goes from underutilizing quota to
using it's entire quota, it will not be able to burst in the
subsequent periods.  Therefore in an absolute worst case contrived
scenario, a bursty task can add at most 5ms to the latency of other
threads on the same CPU.  I think this worst case 5ms tradeoff is
entirely worth it.

This does mean that a theoretically a poorly written massively
threaded application on an 80 core box, that spreads itself onto 80
cpu run queues, can overutilize it's quota in a period by at most 5ms
* 80 CPUs in a sincle period (slice * number of runqueues the
application is running on).  But that means that each of those threads
 would have had to not be use their quota in a previous period, and it
also means that the application would have to be carefully written to
exacerbate this behavior.

Additionally if cpu bound threads underutilize a slice of their quota
in a period due to the cfs choosing a bursty task to run, they should
theoretically be able to make it up in the following periods when the
bursty task is unable to "burst".

Please be careful here quota and slice are being treated differently.
Quota does not roll-over between periods, only slices of quota that
has already been allocated to per cpu run queues. If you allocate
100ms of quota per period to an application, but it only spreads onto
3 cpu run queues that means it can in the worst case use 3 x slice
size = 15ms in periods following underutilization.

So why does this matter.  Well applications that use thread pools
*(*cough* java *cough*) with lots of tiny little worker threads, tend
to spread themselves out onto a lot of run queues.  These worker
threads grab quota slices in order to run, then rarely use all of
their slice (1 or 2ms out of the 5ms).  This results in those worker
threads starving the main application of quota, and then expiring the
remainder of that quota slice on the per-cpu.  Going back to my
earlier 100ms quota / 80 cpu example.  That means only
100ms/cfs_bandwidth_slice_us(5ms) = 20 slices are available in a
period.  So only 20 out of these 80 cpus ever get a slice allocated to
them.  By allowing these per-cpu run queues to use their remaining
slice in following periods these worker threads do not need to be
allocated additional slice, and thereby the main threads are actually
able to use the allocated cpu quota.

This can be experienced by running fibtest available at
https://github.com/indeedeng/fibtest/.
$ runfibtest 1
runs a single fast thread taskset to cpu 0
$ runfibtest 8
Runs a single fast thread taskset to cpu 0, and 7 slow threads taskset
to cpus 1-7.  This run is expected to show less iterations, but the
worse problem is that the cpu usage is far less than the 500ms that it
should have received.

Thanks for the engagement on this,
Dave Chiluk

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-24 15:14         ` Dave Chiluk
@ 2019-05-24 15:59           ` Phil Auld
  2019-05-24 16:28           ` Peter Oskolkov
  1 sibling, 0 replies; 40+ messages in thread
From: Phil Auld @ 2019-05-24 15:59 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc, Ben Segall

On Fri, May 24, 2019 at 10:14:36AM -0500 Dave Chiluk wrote:
> On Fri, May 24, 2019 at 9:32 AM Phil Auld <pauld@redhat.com> wrote:
> > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:
> 
> > > If the machine runs at/close to capacity, won't the overallocation
> > > of the quota to bursty tasks necessarily negatively impact every other
> > > task? Should the "unused" quota be available only on idle CPUs?
> > > (Or maybe this is the behavior achieved here, and only the comment and
> > > the commit message should be fixed...)
> > >
> >
> > It's bounded by the amount left unused from the previous period. So
> > theoretically a process could use almost twice its quota. But then it
> > would have nothing left over in the next period. To repeat it would have
> > to not use any that next period. Over a longer number of periods it's the
> > same amount of CPU usage.
> >
> > I think that is more fair than throttling a process that has never used
> > its full quota.
> >
> > And it removes complexity.
> >
> > Cheers,
> > Phil
> 
> Actually it's not even that bad.  The overallocation of quota to a
> bursty task in a period is limited to at most one slice per cpu, and
> that slice must not have been used in the previous periods.  The slice
> size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and
> defaults to 5ms.  If a bursty task goes from underutilizing quota to
> using it's entire quota, it will not be able to burst in the
> subsequent periods.  Therefore in an absolute worst case contrived
> scenario, a bursty task can add at most 5ms to the latency of other
> threads on the same CPU.  I think this worst case 5ms tradeoff is
> entirely worth it.
> 
> This does mean that a theoretically a poorly written massively
> threaded application on an 80 core box, that spreads itself onto 80
> cpu run queues, can overutilize it's quota in a period by at most 5ms
> * 80 CPUs in a sincle period (slice * number of runqueues the
> application is running on).  But that means that each of those threads
>  would have had to not be use their quota in a previous period, and it
> also means that the application would have to be carefully written to
> exacerbate this behavior.
> 
> Additionally if cpu bound threads underutilize a slice of their quota
> in a period due to the cfs choosing a bursty task to run, they should
> theoretically be able to make it up in the following periods when the
> bursty task is unable to "burst".
> 
> Please be careful here quota and slice are being treated differently.
> Quota does not roll-over between periods, only slices of quota that
> has already been allocated to per cpu run queues. If you allocate
> 100ms of quota per period to an application, but it only spreads onto
> 3 cpu run queues that means it can in the worst case use 3 x slice
> size = 15ms in periods following underutilization.
> 
> So why does this matter.  Well applications that use thread pools
> *(*cough* java *cough*) with lots of tiny little worker threads, tend
> to spread themselves out onto a lot of run queues.  These worker
> threads grab quota slices in order to run, then rarely use all of
> their slice (1 or 2ms out of the 5ms).  This results in those worker
> threads starving the main application of quota, and then expiring the
> remainder of that quota slice on the per-cpu.  Going back to my
> earlier 100ms quota / 80 cpu example.  That means only
> 100ms/cfs_bandwidth_slice_us(5ms) = 20 slices are available in a
> period.  So only 20 out of these 80 cpus ever get a slice allocated to
> them.  By allowing these per-cpu run queues to use their remaining
> slice in following periods these worker threads do not need to be
> allocated additional slice, and thereby the main threads are actually
> able to use the allocated cpu quota.
> 
> This can be experienced by running fibtest available at
> https://github.com/indeedeng/fibtest/.
> $ runfibtest 1
> runs a single fast thread taskset to cpu 0
> $ runfibtest 8
> Runs a single fast thread taskset to cpu 0, and 7 slow threads taskset
> to cpus 1-7.  This run is expected to show less iterations, but the
> worse problem is that the cpu usage is far less than the 500ms that it
> should have received.
> 
> Thanks for the engagement on this,
> Dave Chiluk

Thanks for the clarification. This is an even better explanation. 

Fwiw, I ran some of my cfs throttling tests with this, none of which are
designed to measure or hit this particular issue. They are more focused
on starvation and hard lockups that I've hit. But I did not see any issues
or oddities with this patch. 

Cheers,
Phil

-- 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-24 15:14         ` Dave Chiluk
  2019-05-24 15:59           ` Phil Auld
@ 2019-05-24 16:28           ` Peter Oskolkov
  2019-05-24 21:35             ` Dave Chiluk
  1 sibling, 1 reply; 40+ messages in thread
From: Peter Oskolkov @ 2019-05-24 16:28 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc, Ben Segall

On Fri, May 24, 2019 at 8:15 AM Dave Chiluk <chiluk+linux@indeed.com> wrote:
>
> On Fri, May 24, 2019 at 9:32 AM Phil Auld <pauld@redhat.com> wrote:
> > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:
>
> > > If the machine runs at/close to capacity, won't the overallocation
> > > of the quota to bursty tasks necessarily negatively impact every other
> > > task? Should the "unused" quota be available only on idle CPUs?
> > > (Or maybe this is the behavior achieved here, and only the comment and
> > > the commit message should be fixed...)
> > >
> >
> > It's bounded by the amount left unused from the previous period. So
> > theoretically a process could use almost twice its quota. But then it
> > would have nothing left over in the next period. To repeat it would have
> > to not use any that next period. Over a longer number of periods it's the
> > same amount of CPU usage.
> >
> > I think that is more fair than throttling a process that has never used
> > its full quota.
> >
> > And it removes complexity.
> >
> > Cheers,
> > Phil
>
> Actually it's not even that bad.  The overallocation of quota to a
> bursty task in a period is limited to at most one slice per cpu, and
> that slice must not have been used in the previous periods.  The slice
> size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and
> defaults to 5ms.  If a bursty task goes from underutilizing quota to
> using it's entire quota, it will not be able to burst in the
> subsequent periods.  Therefore in an absolute worst case contrived
> scenario, a bursty task can add at most 5ms to the latency of other
> threads on the same CPU.  I think this worst case 5ms tradeoff is
> entirely worth it.
>
> This does mean that a theoretically a poorly written massively
> threaded application on an 80 core box, that spreads itself onto 80
> cpu run queues, can overutilize it's quota in a period by at most 5ms
> * 80 CPUs in a sincle period (slice * number of runqueues the
> application is running on).  But that means that each of those threads
>  would have had to not be use their quota in a previous period, and it
> also means that the application would have to be carefully written to
> exacerbate this behavior.
>
> Additionally if cpu bound threads underutilize a slice of their quota
> in a period due to the cfs choosing a bursty task to run, they should
> theoretically be able to make it up in the following periods when the
> bursty task is unable to "burst".

OK, so it is indeed possible that CPU bound threads will underutilize a slice
of their quota in a period as a result of this patch. This should probably
be clearly stated in the code comments and in the commit message.

In addition, I believe that although many workloads will indeed be
indifferent to getting their fair share "later", some latency-sensitive
workloads will definitely be negatively affected by this temporary
CPU quota stealing by bursty antagonists. So there should probably be
a way to limit this behavior; for example, by making it tunable
per cgroup.

>
> Please be careful here quota and slice are being treated differently.
> Quota does not roll-over between periods, only slices of quota that
> has already been allocated to per cpu run queues. If you allocate
> 100ms of quota per period to an application, but it only spreads onto
> 3 cpu run queues that means it can in the worst case use 3 x slice
> size = 15ms in periods following underutilization.
>
> So why does this matter.  Well applications that use thread pools
> *(*cough* java *cough*) with lots of tiny little worker threads, tend
> to spread themselves out onto a lot of run queues.  These worker
> threads grab quota slices in order to run, then rarely use all of
> their slice (1 or 2ms out of the 5ms).  This results in those worker
> threads starving the main application of quota, and then expiring the
> remainder of that quota slice on the per-cpu.  Going back to my
> earlier 100ms quota / 80 cpu example.  That means only
> 100ms/cfs_bandwidth_slice_us(5ms) = 20 slices are available in a
> period.  So only 20 out of these 80 cpus ever get a slice allocated to
> them.  By allowing these per-cpu run queues to use their remaining
> slice in following periods these worker threads do not need to be
> allocated additional slice, and thereby the main threads are actually
> able to use the allocated cpu quota.
>
> This can be experienced by running fibtest available at
> https://github.com/indeedeng/fibtest/.
> $ runfibtest 1
> runs a single fast thread taskset to cpu 0
> $ runfibtest 8
> Runs a single fast thread taskset to cpu 0, and 7 slow threads taskset
> to cpus 1-7.  This run is expected to show less iterations, but the
> worse problem is that the cpu usage is far less than the 500ms that it
> should have received.
>
> Thanks for the engagement on this,
> Dave Chiluk

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-24 16:28           ` Peter Oskolkov
@ 2019-05-24 21:35             ` Dave Chiluk
  2019-05-24 22:07               ` Peter Oskolkov
  0 siblings, 1 reply; 40+ messages in thread
From: Dave Chiluk @ 2019-05-24 21:35 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc, Ben Segall

On Fri, May 24, 2019 at 11:28 AM Peter Oskolkov <posk@posk.io> wrote:
>
> On Fri, May 24, 2019 at 8:15 AM Dave Chiluk <chiluk+linux@indeed.com> wrote:
> >
> > On Fri, May 24, 2019 at 9:32 AM Phil Auld <pauld@redhat.com> wrote:
> > > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:
> >
> > > > If the machine runs at/close to capacity, won't the overallocation
> > > > of the quota to bursty tasks necessarily negatively impact every other
> > > > task? Should the "unused" quota be available only on idle CPUs?
> > > > (Or maybe this is the behavior achieved here, and only the comment and
> > > > the commit message should be fixed...)
> > > >
> > >
> > > It's bounded by the amount left unused from the previous period. So
> > > theoretically a process could use almost twice its quota. But then it
> > > would have nothing left over in the next period. To repeat it would have
> > > to not use any that next period. Over a longer number of periods it's the
> > > same amount of CPU usage.
> > >
> > > I think that is more fair than throttling a process that has never used
> > > its full quota.
> > >
> > > And it removes complexity.
> > >
> > > Cheers,
> > > Phil
> >
> > Actually it's not even that bad.  The overallocation of quota to a
> > bursty task in a period is limited to at most one slice per cpu, and
> > that slice must not have been used in the previous periods.  The slice
> > size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and
> > defaults to 5ms.  If a bursty task goes from underutilizing quota to
> > using it's entire quota, it will not be able to burst in the
> > subsequent periods.  Therefore in an absolute worst case contrived
> > scenario, a bursty task can add at most 5ms to the latency of other
> > threads on the same CPU.  I think this worst case 5ms tradeoff is
> > entirely worth it.
> >
> > This does mean that a theoretically a poorly written massively
> > threaded application on an 80 core box, that spreads itself onto 80
> > cpu run queues, can overutilize it's quota in a period by at most 5ms
> > * 80 CPUs in a sincle period (slice * number of runqueues the
> > application is running on).  But that means that each of those threads
> >  would have had to not be use their quota in a previous period, and it
> > also means that the application would have to be carefully written to
> > exacerbate this behavior.
> >
> > Additionally if cpu bound threads underutilize a slice of their quota
> > in a period due to the cfs choosing a bursty task to run, they should
> > theoretically be able to make it up in the following periods when the
> > bursty task is unable to "burst".
>
> OK, so it is indeed possible that CPU bound threads will underutilize a slice
> of their quota in a period as a result of this patch. This should probably
> be clearly stated in the code comments and in the commit message.
>
> In addition, I believe that although many workloads will indeed be
> indifferent to getting their fair share "later", some latency-sensitive
> workloads will definitely be negatively affected by this temporary
> CPU quota stealing by bursty antagonists. So there should probably be
> a way to limit this behavior; for example, by making it tunable
> per cgroup.
>
This patch restores the behavior that existed from at least
v3.16..v4.18, and the current Redhat 7 kernels.  So you are kind of
championing a moot point as no one has noticed this "bursting"
behavior in over 5 years.  By removing this slice expiration
altogether we restore the behavior and also solve the root issue of
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'.

Since 512ac999d275, many people are now noticing the slice expiration
and very displeased with the behavior change.
see: https://github.com/kubernetes/kubernetes/issues/67577

I would love to hear if you know of a single complaint during that 5
year time window, where someone noticed this bursting and reported
that it negatively affected their application.

As for the documentation, I thought about documenting the possible
adverse side-effect, but I didn't feel it was worthwhile since no one
had noticed that corner case in the 5 years.  I also could not figure
out a concise way of describing the slight corner case issue without
overly complicating the documentation.  I felt that adding that corner
case would have detracted rather than added to the documentation's
usefulness.  As far as commenting in code, considering most of this
commit removes lines, there's not a really great place for it.  That's
why I did my best to describe the behavior in the documentation.
Smart people can draw conclusions from there as you have done.  Please
keep in mind that this "bursting" is again limited to a single slice
on each per-cpu run-queue.

Thank you,
Dave

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-24 21:35             ` Dave Chiluk
@ 2019-05-24 22:07               ` Peter Oskolkov
  2019-05-28 22:25                 ` Dave Chiluk
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Oskolkov @ 2019-05-24 22:07 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc, Ben Segall

On Fri, May 24, 2019 at 2:35 PM Dave Chiluk <chiluk+linux@indeed.com> wrote:
>
> On Fri, May 24, 2019 at 11:28 AM Peter Oskolkov <posk@posk.io> wrote:
> >
> > On Fri, May 24, 2019 at 8:15 AM Dave Chiluk <chiluk+linux@indeed.com> wrote:
> > >
> > > On Fri, May 24, 2019 at 9:32 AM Phil Auld <pauld@redhat.com> wrote:
> > > > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:
> > >
> > > > > If the machine runs at/close to capacity, won't the overallocation
> > > > > of the quota to bursty tasks necessarily negatively impact every other
> > > > > task? Should the "unused" quota be available only on idle CPUs?
> > > > > (Or maybe this is the behavior achieved here, and only the comment and
> > > > > the commit message should be fixed...)
> > > > >
> > > >
> > > > It's bounded by the amount left unused from the previous period. So
> > > > theoretically a process could use almost twice its quota. But then it
> > > > would have nothing left over in the next period. To repeat it would have
> > > > to not use any that next period. Over a longer number of periods it's the
> > > > same amount of CPU usage.
> > > >
> > > > I think that is more fair than throttling a process that has never used
> > > > its full quota.
> > > >
> > > > And it removes complexity.
> > > >
> > > > Cheers,
> > > > Phil
> > >
> > > Actually it's not even that bad.  The overallocation of quota to a
> > > bursty task in a period is limited to at most one slice per cpu, and
> > > that slice must not have been used in the previous periods.  The slice
> > > size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and
> > > defaults to 5ms.  If a bursty task goes from underutilizing quota to
> > > using it's entire quota, it will not be able to burst in the
> > > subsequent periods.  Therefore in an absolute worst case contrived
> > > scenario, a bursty task can add at most 5ms to the latency of other
> > > threads on the same CPU.  I think this worst case 5ms tradeoff is
> > > entirely worth it.
> > >
> > > This does mean that a theoretically a poorly written massively
> > > threaded application on an 80 core box, that spreads itself onto 80
> > > cpu run queues, can overutilize it's quota in a period by at most 5ms
> > > * 80 CPUs in a sincle period (slice * number of runqueues the
> > > application is running on).  But that means that each of those threads
> > >  would have had to not be use their quota in a previous period, and it
> > > also means that the application would have to be carefully written to
> > > exacerbate this behavior.
> > >
> > > Additionally if cpu bound threads underutilize a slice of their quota
> > > in a period due to the cfs choosing a bursty task to run, they should
> > > theoretically be able to make it up in the following periods when the
> > > bursty task is unable to "burst".
> >
> > OK, so it is indeed possible that CPU bound threads will underutilize a slice
> > of their quota in a period as a result of this patch. This should probably
> > be clearly stated in the code comments and in the commit message.
> >
> > In addition, I believe that although many workloads will indeed be
> > indifferent to getting their fair share "later", some latency-sensitive
> > workloads will definitely be negatively affected by this temporary
> > CPU quota stealing by bursty antagonists. So there should probably be
> > a way to limit this behavior; for example, by making it tunable
> > per cgroup.
> >
> This patch restores the behavior that existed from at least
> v3.16..v4.18, and the current Redhat 7 kernels.  So you are kind of
> championing a moot point as no one has noticed this "bursting"
> behavior in over 5 years.  By removing this slice expiration
> altogether we restore the behavior and also solve the root issue of
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'.
>
> Since 512ac999d275, many people are now noticing the slice expiration
> and very displeased with the behavior change.
> see: https://github.com/kubernetes/kubernetes/issues/67577
>
> I would love to hear if you know of a single complaint during that 5
> year time window, where someone noticed this bursting and reported
> that it negatively affected their application.

Linux CPU scheduling tail latency is a well-known issue and a major
pain point in some workloads:
https://www.google.com/search?q=linux+cpu+scheduling+tail+latency

Even assuming that nobody noticed this particular cause
of CPU scheduling latencies, it does not mean the problem should be waved
away. At least it should be documented, if at this point it decided that
it is difficult to address it in a meaningful way. And, preferably, a way
to address the issue later on should be discussed and hopefully agreed to.

>
> As for the documentation, I thought about documenting the possible
> adverse side-effect, but I didn't feel it was worthwhile since no one
> had noticed that corner case in the 5 years.  I also could not figure
> out a concise way of describing the slight corner case issue without
> overly complicating the documentation.  I felt that adding that corner
> case would have detracted rather than added to the documentation's
> usefulness.  As far as commenting in code, considering most of this
> commit removes lines, there's not a really great place for it.  That's
> why I did my best to describe the behavior in the documentation.
> Smart people can draw conclusions from there as you have done.  Please
> keep in mind that this "bursting" is again limited to a single slice
> on each per-cpu run-queue.
>
> Thank you,
> Dave

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-24 22:07               ` Peter Oskolkov
@ 2019-05-28 22:25                 ` Dave Chiluk
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chiluk @ 2019-05-28 22:25 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc, Ben Segall

On Fri, May 24, 2019 at 5:07 PM Peter Oskolkov <posk@posk.io> wrote:
> Linux CPU scheduling tail latency is a well-known issue and a major
> pain point in some workloads:
> https://www.google.com/search?q=linux+cpu+scheduling+tail+latency
>
> Even assuming that nobody noticed this particular cause
> of CPU scheduling latencies, it does not mean the problem should be waved
> away. At least it should be documented, if at this point it decided that
> it is difficult to address it in a meaningful way. And, preferably, a way
> to address the issue later on should be discussed and hopefully agreed to.

Pursuing reducing tail latencies for our web application is the
precise reason I created this patch set.  Those applications that
previously were responding in 20ms 95% where now taking 220ms.  Those
were correctly sized applications prior to 512ac999. After which, they
started seeing massive increases in their latencies  due to hitting
throttling with lower than quota amounts of cpu usage.

I'll see if I can rework the documentation.  Any specific
suggestions for how that can be worded would be appreciated.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk
  2019-05-23 18:44 ` [PATCH v2 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk
@ 2019-05-29 19:08 ` Dave Chiluk
  2019-05-29 19:08   ` [PATCH v3 1/1] " Dave Chiluk
       [not found] ` <1561391404-14450-1-git-send-email-chiluk+linux@indeed.com>
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 40+ messages in thread
From: Dave Chiluk @ 2019-05-29 19:08 UTC (permalink / raw)
  To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc

Changelog v3
	- Reworked documentation to better describe behavior of slice expiration
	  per feedback from Peter Oskolkov

Changelog v2
	- Fixed some checkpatch errors in the commit message.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-29 19:08 ` [PATCH v3 0/1] " Dave Chiluk
@ 2019-05-29 19:08   ` Dave Chiluk
  2019-05-29 19:28     ` Phil Auld
                       ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Dave Chiluk @ 2019-05-29 19:08 UTC (permalink / raw)
  To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota.  This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period. At which
point the slice and quota expires.  This expiration of unused slice
results in applications not being able to utilize the quota for which
they are allocated.

The expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'.  Prior to that it appears that this has been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014.  That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows only per-cpu slices to live longer than the period boundary.
This allows threads on runqueues that do not use much CPU to continue to
use their remaining slice over a longer period of time than
cpu.cfs_period_us. However, this helps prevents the above condition of
hitting throttling while also not fully utilizing your cpu quota.

This theoretically allows a machine to use slightly more than it's
allotted quota in some periods.  This overflow would be bounded by the
remaining per-cpu slice that was left un-used in the previous period.
For CPU bound tasks this will change nothing, as they should
theoretically fully utilize all of their quota and slices in each
period. For user-interactive tasks as described above this provides a
much better user/application experience as their cpu utilization will
more closely match the amount they requested when they hit throttling.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase, this performance
discrepancy has been observed to be almost 30x performance improvement,
while still maintaining correct cpu quota restrictions albeit over
longer time intervals than cpu.cfs_period_us.  That testcase is
available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
---
 Documentation/scheduler/sched-bwc.txt | 56 ++++++++++++++++++++++-----
 kernel/sched/fair.c                   | 71 +++--------------------------------
 kernel/sched/sched.h                  |  4 --
 3 files changed, 53 insertions(+), 78 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..260fd65 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
 specification of the maximum CPU bandwidth available to a group or hierarchy.
 
 The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary.  As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis.  The amount transferred
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time.  That quota is assigned to per cpu run queues in
+slices as threads in the cgroup become runnable.  Once all quota has been
+assigned any additional requests for quota will result in those threads being
+throttled.  Throttled threads will not be able to run again until the next
+period when the quota is replenished.
+
+A group's unassigned quota is globally tracked, being refreshed back to
+cfs_quota units at each period boundary.  As threads consume this bandwidth it
+is transferred to cpu-local "silos" on a demand basis.  The amount transferred
 within each of these updates is tunable and described as the "slice".
 
 Management
@@ -90,6 +91,43 @@ There are two ways in which a group may become throttled:
 In case b) above, even though the child may have runtime remaining it will not
 be allowed to until the parent's runtime is refreshed.
 
+Real-world behavior of slice non-expiration
+-------------------------------------------
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period.  As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+However in a worst-case scenario, highly-threaded, interactive/non-cpu bound
+applications this non-expiration nuance allows applications to briefly burst
+past their quota limits by the amount of unused slice on each cpu that the task
+group is running on.  This slight burst requires that quota had been assigned
+and then not fully used in previous periods.  This burst amount will not be
+transferred between cores.  As a result, this mechanism still strictly limits
+the task group to quota average usage, albeit over a longer time window than
+period.  This provides better more predictable user experience for highly
+threaded applications with small quota limits on high core count machines.  It
+also eliminates the propensity to throttle these applications while
+simultanously using less than quota amounts of cpu.  Another way to say this,
+is that by allowing the unused portion of a slice to remain valid across
+periods we have decreased the possibility of wasting quota on cpu-local silos
+that don't need a full slice's amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%.  If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to sched_cfs_bandwidth_slice_us additional quota in some periods,
+thereby preventing the cpu-bound application from fully using it's quota by
+that same amount.  In these instances it will be up to the CFS algorithm (see
+sched-design-CFS.txt) to decide which application is chosen to run, as they
+will both be runnable and have remaining quota.  This runtime discrepancy will
+should made up in the following periods when the interactive application idles.
+
 Examples
 --------
 1. Limit a group to 1 CPU worth of runtime.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..a675c69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount, expires;
-	int expires_seq;
+	u64 amount = 0, min_amount;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
-	expires_seq = cfs_b->expires_seq;
-	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
-	/*
-	 * we may have advanced our local expiration to account for allowed
-	 * spread between our sched_clock and the one on which runtime was
-	 * issued.
-	 */
-	if (cfs_rq->expires_seq != expires_seq) {
-		cfs_rq->expires_seq = expires_seq;
-		cfs_rq->runtime_expires = expires;
-	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
-		return;
-
-	if (cfs_rq->runtime_remaining < 0)
-		return;
-
-	/*
-	 * If the local deadline has passed we have to consider the
-	 * possibility that our sched_clock is 'fast' and the global deadline
-	 * has not truly expired.
-	 *
-	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline(cfs_b->expires_seq) has advanced.
-	 */
-	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
-		/* extend local deadline, drift is bounded above by 2 ticks */
-		cfs_rq->runtime_expires += TICK_NSEC;
-	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
-	}
-}
-
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	expire_cfs_rq_runtime(cfs_rq);
 
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
@@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		resched_curr(rq);
 }
 
-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
-		u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime;
@@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		remaining -= runtime;
 
 		cfs_rq->runtime_remaining += runtime;
-		cfs_rq->runtime_expires = expires;
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0)
@@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
-	u64 runtime, runtime_expires;
+	u64 runtime;
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
@@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	/* account preceding periods in which throttling occurred */
 	cfs_b->nr_throttled += overrun;
 
-	runtime_expires = cfs_b->runtime_expires;
-
 	/*
 	 * This check is repeated as we are holding onto the new bandwidth while
 	 * we unthrottle. This can potentially race with an unthrottled group
@@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 		cfs_b->distribute_running = 1;
 		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
+		runtime = distribute_cfs_runtime(cfs_b, runtime);
 		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
@@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
@@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
 	unsigned long flags;
-	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	expires = cfs_b->runtime_expires;
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
@@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (!runtime)
 		return;
 
-	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+	runtime = distribute_cfs_runtime(cfs_b, runtime);
 
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
-	if (expires == cfs_b->runtime_expires)
-		lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
 	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
@@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 
 	cfs_b->period_active = 1;
 	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..0c0ed23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,8 +341,6 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	s64			hierarchical_quota;
-	u64			runtime_expires;
-	int			expires_seq;
 
 	short			idle;
 	short			period_active;
@@ -562,8 +560,6 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
-	int			expires_seq;
-	u64			runtime_expires;
 	s64			runtime_remaining;
 
 	u64			throttled_clock;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-29 19:08   ` [PATCH v3 1/1] " Dave Chiluk
@ 2019-05-29 19:28     ` Phil Auld
  2019-05-29 19:50     ` bsegall
  2019-05-29 21:05     ` bsegall
  2 siblings, 0 replies; 40+ messages in thread
From: Phil Auld @ 2019-05-29 19:28 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Ben Segall, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos,
	John Hammond, Cong Wang, Jonathan Corbet, linux-doc

On Wed, May 29, 2019 at 02:08:46PM -0500 Dave Chiluk wrote:
> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota.  This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
> 
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. At which
> point the slice and quota expires.  This expiration of unused slice
> results in applications not being able to utilize the quota for which
> they are allocated.
> 
> The expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'.  Prior to that it appears that this has been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014.  That
> added the following conditional which resulted in slices never being
> expired.
> 
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> 	/* extend local deadline, drift is bounded above by 2 ticks */
> 	cfs_rq->runtime_expires += TICK_NSEC;
> 
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
> 
> This allows only per-cpu slices to live longer than the period boundary.
> This allows threads on runqueues that do not use much CPU to continue to
> use their remaining slice over a longer period of time than
> cpu.cfs_period_us. However, this helps prevents the above condition of
> hitting throttling while also not fully utilizing your cpu quota.
> 
> This theoretically allows a machine to use slightly more than it's
> allotted quota in some periods.  This overflow would be bounded by the
> remaining per-cpu slice that was left un-used in the previous period.
> For CPU bound tasks this will change nothing, as they should
> theoretically fully utilize all of their quota and slices in each
> period. For user-interactive tasks as described above this provides a
> much better user/application experience as their cpu utilization will
> more closely match the amount they requested when they hit throttling.
> 
> This greatly improves performance of high-thread-count, non-cpu bound
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase, this performance
> discrepancy has been observed to be almost 30x performance improvement,
> while still maintaining correct cpu quota restrictions albeit over
> longer time intervals than cpu.cfs_period_us.  That testcase is
> available at https://github.com/indeedeng/fibtest.
> 
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
> ---
>  Documentation/scheduler/sched-bwc.txt | 56 ++++++++++++++++++++++-----
>  kernel/sched/fair.c                   | 71 +++--------------------------------
>  kernel/sched/sched.h                  |  4 --
>  3 files changed, 53 insertions(+), 78 deletions(-)
> 
> diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
> index f6b1873..260fd65 100644
> --- a/Documentation/scheduler/sched-bwc.txt
> +++ b/Documentation/scheduler/sched-bwc.txt
> @@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
>  specification of the maximum CPU bandwidth available to a group or hierarchy.
>  
>  The bandwidth allowed for a group is specified using a quota and period. Within
> -each given "period" (microseconds), a group is allowed to consume only up to
> -"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
> -group exceeds this limit (for that period), the tasks belonging to its
> -hierarchy will be throttled and are not allowed to run again until the next
> -period.
> -
> -A group's unused runtime is globally tracked, being refreshed with quota units
> -above at each period boundary.  As threads consume this bandwidth it is
> -transferred to cpu-local "silos" on a demand basis.  The amount transferred
> +each given "period" (microseconds), a task group is allocated up to "quota"
> +microseconds of CPU time.  That quota is assigned to per cpu run queues in
> +slices as threads in the cgroup become runnable.  Once all quota has been
> +assigned any additional requests for quota will result in those threads being
> +throttled.  Throttled threads will not be able to run again until the next
> +period when the quota is replenished.
> +
> +A group's unassigned quota is globally tracked, being refreshed back to
> +cfs_quota units at each period boundary.  As threads consume this bandwidth it
> +is transferred to cpu-local "silos" on a demand basis.  The amount transferred
>  within each of these updates is tunable and described as the "slice".
>  
>  Management
> @@ -90,6 +91,43 @@ There are two ways in which a group may become throttled:
>  In case b) above, even though the child may have runtime remaining it will not
>  be allowed to until the parent's runtime is refreshed.
>  
> +Real-world behavior of slice non-expiration
> +-------------------------------------------
> +The fact that cpu-local slices do not expire results in some interesting corner
> +cases that should be understood.
> +
> +For cgroup cpu constrained applications that are cpu limited this is a
> +relatively moot point because they will naturally consume the entirety of their
> +quota as well as the entirety of each cpu-local slice in each period.  As a
> +result it is expected that nr_periods roughly equal nr_throttled, and that
> +cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
> +
> +However in a worst-case scenario, highly-threaded, interactive/non-cpu bound
> +applications this non-expiration nuance allows applications to briefly burst
> +past their quota limits by the amount of unused slice on each cpu that the task
> +group is running on.  This slight burst requires that quota had been assigned
> +and then not fully used in previous periods.  This burst amount will not be
> +transferred between cores.  As a result, this mechanism still strictly limits
> +the task group to quota average usage, albeit over a longer time window than
> +period.  This provides better more predictable user experience for highly
> +threaded applications with small quota limits on high core count machines.  It
> +also eliminates the propensity to throttle these applications while
> +simultanously using less than quota amounts of cpu.  Another way to say this,
> +is that by allowing the unused portion of a slice to remain valid across
> +periods we have decreased the possibility of wasting quota on cpu-local silos
> +that don't need a full slice's amount of cpu time.
> +
> +The interaction between cpu-bound and non-cpu-bound-interactive applications
> +should also be considered, especially when single core usage hits 100%.  If you
> +gave each of these applications half of a cpu-core and they both got scheduled
> +on the same CPU it is theoretically possible that the non-cpu bound application
> +will use up to sched_cfs_bandwidth_slice_us additional quota in some periods,
> +thereby preventing the cpu-bound application from fully using it's quota by


"its quota"


> +that same amount.  In these instances it will be up to the CFS algorithm (see
> +sched-design-CFS.txt) to decide which application is chosen to run, as they
> +will both be runnable and have remaining quota.  This runtime discrepancy will
> +should made up in the following periods when the interactive application idles.
> +


"discrepancy will be made"  or "descrepancy should be made"  but not both :)



Otherwise, fwiw, 

Acked-by:  Phil Auld <pauld@redhat.com>



Cheers,
Phil


-- 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-29 19:08   ` [PATCH v3 1/1] " Dave Chiluk
  2019-05-29 19:28     ` Phil Auld
@ 2019-05-29 19:50     ` bsegall
  2019-05-29 21:05     ` bsegall
  2 siblings, 0 replies; 40+ messages in thread
From: bsegall @ 2019-05-29 19:50 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos,
	John Hammond, Cong Wang, Jonathan Corbet, linux-doc

Dave Chiluk <chiluk+linux@indeed.com> writes:

> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota.  This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. At which
> point the slice and quota expires.  This expiration of unused slice
> results in applications not being able to utilize the quota for which
> they are allocated.
>

So most of the bandwidth is supposed to be returned, leaving only 1ms.
I'm setting up to try your test now, but there was a bug with the
unthrottle part of this where it could be continuously pushed into the
future. You might try this patch instead, possibly along with lowering
min_cfs_rq_runtime (currently not runtime tunable) if you see that cost
more than the cost of extra locking to acquire runtime.





From: Ben Segall <bsegall@google.com>
Date: Tue, 28 May 2019 12:30:25 -0700
Subject: [PATCH] sched/fair: don't push cfs_bandwith slack timers forward

When a cfs_rq sleeps and returns its quota, we delay for 5ms before waking any
throttled cfs_rqs to coalesce with other cfs_rqs going to sleep, as this has has
to be done outside of the rq lock we hold. The current code waits for 5ms
without any sleeps, instead of waiting for 5ms from the first sleep, which can
delay the unthrottle more than we want. Switch this around.

Change-Id: Ife536bb54af633863de486ba6ec0d20e55ada8c9
Signed-off-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c  | 7 +++++++
 kernel/sched/sched.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8213ff6e365d..2ead252cfa32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4729,6 +4729,11 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
 	if (runtime_refresh_within(cfs_b, min_left))
 		return;
 
+	/* don't push forwards an existing deferred unthrottle */
+	if (cfs_b->slack_started)
+		return;
+	cfs_b->slack_started = true;
+
 	hrtimer_start(&cfs_b->slack_timer,
 			ns_to_ktime(cfs_bandwidth_slack_period),
 			HRTIMER_MODE_REL);
@@ -4782,6 +4787,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
+	cfs_b->slack_started = false;
 	if (cfs_b->distribute_running) {
 		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		return;
@@ -4920,6 +4926,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 	cfs_b->distribute_running = 0;
+	cfs_b->slack_started = false;
 }
 
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efa686eeff26..60219acda94b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -356,6 +356,7 @@ struct cfs_bandwidth {
 	u64			throttled_time;
 
 	bool                    distribute_running;
+	bool                    slack_started;
 #endif
 };
 
-- 
2.22.0.rc1.257.g3120a18244-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-29 19:08   ` [PATCH v3 1/1] " Dave Chiluk
  2019-05-29 19:28     ` Phil Auld
  2019-05-29 19:50     ` bsegall
@ 2019-05-29 21:05     ` bsegall
  2019-05-30 17:53       ` Dave Chiluk
  2 siblings, 1 reply; 40+ messages in thread
From: bsegall @ 2019-05-29 21:05 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos,
	John Hammond, Cong Wang, Jonathan Corbet, linux-doc, pjt

Dave Chiluk <chiluk+linux@indeed.com> writes:

> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota.  This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. At which
> point the slice and quota expires.  This expiration of unused slice
> results in applications not being able to utilize the quota for which
> they are allocated.
>
> The expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'.  Prior to that it appears that this has been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014.  That
> added the following conditional which resulted in slices never being
> expired.


Yeah, having run the test, stranding only 1 ms per cpu rather than 5
doesn't help if you only have 10 ms of quota and even 10 threads/cpus.
The slack timer isn't important in this test, though I think it probably
should be changed.

Decreasing min_cfs_rq_runtime helps, but would mean that we have to pull
quota more often / always. The worst case here I think is where you
run/sleep for ~1ns, so you wind up taking the lock twice every
min_cfs_rq_runtime: once for assign and once to return all but min,
which you then use up doing short run/sleep. I suppose that determines
how much we care about this overhead at all.

Removing expiration means that in the worst case period and quota can be
effectively twice what the user specified, but only on very particular
workloads.

I think we should at least think about instead lowering
min_cfs_rq_runtime to some smaller value.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-29 21:05     ` bsegall
@ 2019-05-30 17:53       ` Dave Chiluk
  2019-05-30 20:44         ` bsegall
  0 siblings, 1 reply; 40+ messages in thread
From: Dave Chiluk @ 2019-05-30 17:53 UTC (permalink / raw)
  To: bsegall
  Cc: Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos,
	John Hammond, Cong Wang, Jonathan Corbet, linux-doc, pjt

On Wed, May 29, 2019 at 02:05:55PM -0700, bsegall@google.com wrote:
> Dave Chiluk <chiluk+linux@indeed.com> writes:
>
> Yeah, having run the test, stranding only 1 ms per cpu rather than 5
> doesn't help if you only have 10 ms of quota and even 10 threads/cpus.
> The slack timer isn't important in this test, though I think it probably
> should be changed.
My min_cfs_rq_runtime was already set to 1ms.

Additionally raising the amount of quota from 10ms to 50ms or even
100ms, still results in throttling without full quota usage.

> Decreasing min_cfs_rq_runtime helps, but would mean that we have to pull
> quota more often / always. The worst case here I think is where you
> run/sleep for ~1ns, so you wind up taking the lock twice every
> min_cfs_rq_runtime: once for assign and once to return all but min,
> which you then use up doing short run/sleep. I suppose that determines
> how much we care about this overhead at all.
I'm not so concerned about how inefficiently the user-space application
runs, as that's up to the invidual developer.  The fibtest testcase, is
purely my approximation of what a java application with lots of worker
threads might do, as I didn't have a great deterministic java
reproducer, and I feared posting java to LKML.  I'm more concerned with
the fact that the user requested 10ms/period or 100ms/period and they
hit throttling while simultaneously not seeing that amount of cpu usage.
i.e. on an 8 core machine if I
$ ./runfibtest 1
Iterations Completed(M): 1886
Throttled for: 51
CPU Usage (msecs) = 507
$ ./runfibtest 8
Iterations Completed(M): 1274
Throttled for: 52
CPU Usage (msecs) = 380

You see that in the 8 core case where we have 7 do nothing threads on
cpu's 1-7, we see only 380 ms of usage, and 52 periods of throttling
when we should have received ~500ms of cpu usage.

Looking more closely at the __return_cfs_rq_runtime logic I noticed
        if (cfs_b->quota != RUNTIME_INF &&
            cfs_rq->runtime_expires == cfs_b->runtime_expires) {

Which is awfully similar to the logic that was fixed by 512ac999.  Is it
possible that we are just not ever returning runtime back to the cfs_b
because of the runtime_expires comparison here?

Early on in my testing I created a patch that would report via sysfs the
amount of quota that was expired at the end of a period in
expire_cfs_rq_runtime, and it roughly equalled the difference in quota
between the actual cpu usage and the alloted quota.  That leads me to
believe that something is not going quite correct in the slack return
logic __return_cfs_rq_runtime.  I'll attach that patch at the bottom of
this e-mail in case you'd like to reproduce my tests.

> Removing expiration means that in the worst case period and quota can be
> effectively twice what the user specified, but only on very particular
> workloads.
I'm only removing expiration of slices that have already been assigned
to individual cfs_rq.  My understanding is that there is at most one
cfs_rq per cpu, and each of those can have at most one slice of
available runtime.  So the worst case burst is slice_ms * cpus.  Please
help me understand how you get to twice user specified quota and period
as it's not obvious to me *(I've only been looking at this for a few
months).

> I think we should at least think about instead lowering
> min_cfs_rq_runtime to some smaller value
Do you mean lower than 1ms?

Thanks
Dave.

From: Dave Chiluk <chiluk+linux@indeed.com>
Date: Thu, 30 May 2019 12:47:12 -0500
Subject: [PATCH] Add expired_time sysfs entry

Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
---
 kernel/sched/core.c  | 41 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c  |  4 ++++
 kernel/sched/sched.h |  1 +
 3 files changed, 46 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..3c06df9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6655,6 +6655,30 @@ static long tg_get_cfs_period(struct task_group *tg)
        return cfs_period_us;
 }

+static int tg_set_cfs_expired_runtime(struct task_group *tg, long
slice_expiration)
+{
+       struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+
+       if (slice_expiration == 0)
+       {
+               raw_spin_lock_irq(&cfs_b->lock);
+               cfs_b->expired_runtime= 0;
+               raw_spin_unlock_irq(&cfs_b->lock);
+               return 0;
+       }
+       return 1;
+}
+
+static u64 tg_get_cfs_expired_runtime(struct task_group *tg)
+{
+       u64 expired_runtime;
+
+       expired_runtime = tg->cfs_bandwidth.expired_runtime;
+       do_div(expired_runtime, NSEC_PER_USEC);
+
+       return expired_runtime;
+}
+
 static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
                                  struct cftype *cft)
 {
@@ -6679,6 +6703,18 @@ static int cpu_cfs_period_write_u64(struct
cgroup_subsys_state *css,
        return tg_set_cfs_period(css_tg(css), cfs_period_us);
 }

+static u64 cpu_cfs_expired_runtime_read_u64(struct cgroup_subsys_state *css,
+                                 struct cftype *cft)
+{
+       return tg_get_cfs_expired_runtime(css_tg(css));
+}
+
+static int cpu_cfs_expired_runtime_write_s64(struct cgroup_subsys_state *css,
+                                  struct cftype *cftype, s64
cfs_slice_expiration_us)
+{
+       return tg_set_cfs_expired_runtime(css_tg(css), cfs_slice_expiration_us);
+}
+
 struct cfs_schedulable_data {
        struct task_group *tg;
        u64 period, quota;
@@ -6832,6 +6868,11 @@ static u64 cpu_rt_period_read_uint(struct
cgroup_subsys_state *css,
                .write_u64 = cpu_cfs_period_write_u64,
        },
        {
+               .name = "cfs_expired_runtime",
+               .read_u64 = cpu_cfs_expired_runtime_read_u64,
+               .write_s64 = cpu_cfs_expired_runtime_write_s64,
+       },
+       {
                .name = "stat",
                .seq_show = cpu_cfs_stat_show,
        },
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..bcbd4ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4382,6 +4382,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
                cfs_rq->runtime_expires += TICK_NSEC;
        } else {
                /* global deadline is ahead, expiration has passed */
+               raw_spin_lock_irq(&cfs_b->lock);
+               cfs_b->expired_runtime += cfs_rq->runtime_remaining;
+               raw_spin_unlock_irq(&cfs_b->lock);
                cfs_rq->runtime_remaining = 0;
        }
 }
@@ -4943,6 +4946,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
        cfs_b->runtime = 0;
        cfs_b->quota = RUNTIME_INF;
        cfs_b->period = ns_to_ktime(default_cfs_period());
+       cfs_b->expired_runtime = 0;

        INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
        hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC,
HRTIMER_MODE_ABS_PINNED);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..499d2e2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -343,6 +343,7 @@ struct cfs_bandwidth {
        s64                     hierarchical_quota;
        u64                     runtime_expires;
        int                     expires_seq;
+       u64                     expired_runtime;

        short                   idle;
        short                   period_active;

--
1.8.3.1

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-30 17:53       ` Dave Chiluk
@ 2019-05-30 20:44         ` bsegall
  0 siblings, 0 replies; 40+ messages in thread
From: bsegall @ 2019-05-30 20:44 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos,
	John Hammond, Cong Wang, Jonathan Corbet, linux-doc, pjt

Dave Chiluk <chiluk+linux@indeed.com> writes:

> On Wed, May 29, 2019 at 02:05:55PM -0700, bsegall@google.com wrote:
>> Dave Chiluk <chiluk+linux@indeed.com> writes:
>>
>> Yeah, having run the test, stranding only 1 ms per cpu rather than 5
>> doesn't help if you only have 10 ms of quota and even 10 threads/cpus.
>> The slack timer isn't important in this test, though I think it probably
>> should be changed.
> My min_cfs_rq_runtime was already set to 1ms.

Yeah, I meant min_cfs_rq_runtime vs the 5ms if the slack stuff was
broken.

>
> Additionally raising the amount of quota from 10ms to 50ms or even
> 100ms, still results in throttling without full quota usage.
>
>> Decreasing min_cfs_rq_runtime helps, but would mean that we have to pull
>> quota more often / always. The worst case here I think is where you
>> run/sleep for ~1ns, so you wind up taking the lock twice every
>> min_cfs_rq_runtime: once for assign and once to return all but min,
>> which you then use up doing short run/sleep. I suppose that determines
>> how much we care about this overhead at all.
> I'm not so concerned about how inefficiently the user-space application
> runs, as that's up to the invidual developer.

Increasing scheduler overhead is something we generally try to prevent
is what I was worried about.

> The fibtest testcase, is
> purely my approximation of what a java application with lots of worker
> threads might do, as I didn't have a great deterministic java
> reproducer, and I feared posting java to LKML.  I'm more concerned with
> the fact that the user requested 10ms/period or 100ms/period and they
> hit throttling while simultaneously not seeing that amount of cpu usage.
> i.e. on an 8 core machine if I
> $ ./runfibtest 1
> Iterations Completed(M): 1886
> Throttled for: 51
> CPU Usage (msecs) = 507
> $ ./runfibtest 8
> Iterations Completed(M): 1274
> Throttled for: 52
> CPU Usage (msecs) = 380
>
> You see that in the 8 core case where we have 7 do nothing threads on
> cpu's 1-7, we see only 380 ms of usage, and 52 periods of throttling
> when we should have received ~500ms of cpu usage.
>
> Looking more closely at the __return_cfs_rq_runtime logic I noticed
>         if (cfs_b->quota != RUNTIME_INF &&
>             cfs_rq->runtime_expires == cfs_b->runtime_expires) {
>
> Which is awfully similar to the logic that was fixed by 512ac999.  Is it
> possible that we are just not ever returning runtime back to the cfs_b
> because of the runtime_expires comparison here?

The relevant issue that patch fixes is that the old conditional was
backwards. Also lowering min_cfs_rq_runtime to 0 fixes your testcase, so
it's working.

>
>> Removing expiration means that in the worst case period and quota can be
>> effectively twice what the user specified, but only on very particular
>> workloads.
> I'm only removing expiration of slices that have already been assigned
> to individual cfs_rq.  My understanding is that there is at most one
> cfs_rq per cpu, and each of those can have at most one slice of
> available runtime.  So the worst case burst is slice_ms * cpus.  Please
> help me understand how you get to twice user specified quota and period
> as it's not obvious to me *(I've only been looking at this for a few
> months).

The reason that this effect is so significant is because slice_ms * cpus
is roughly 100% of the quota. So yes, it's roughly the same thing.
Unfortunately if there are more spare cpus on the system just doubling
quota and period (keeping the same ratio) would not fix your issue,
while removing expiration does while also potentially having that effect.

>
>> I think we should at least think about instead lowering
>> min_cfs_rq_runtime to some smaller value
> Do you mean lower than 1ms?

Yes

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining.
       [not found] ` <1561391404-14450-1-git-send-email-chiluk+linux@indeed.com>
@ 2019-06-24 15:50   ` Dave Chiluk
  2019-06-24 17:33     ` bsegall
  0 siblings, 1 reply; 40+ messages in thread
From: Dave Chiluk @ 2019-06-24 15:50 UTC (permalink / raw)
  To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang

It has been observed, that highly-threaded, user-interactive
applications running under cpu.cfs_quota_us constraints can hit a high
percentage of periods throttled while simultaneously not consuming the
allocated amount of quota. This impacts user-interactive non-cpu bound
applications, such as those running in kubernetes or mesos when run on
multiple cores.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period. This
results in min_cfs_rq_runtime remaining on each per-cpu cfs_rq. At the
end of the period this remaining quota goes unused and expires. This
expiration of unused time on per-cpu runqueues results in applications
under-utilizing their quota while simultaneously hitting throttling.

The solution is to return all spare cfs_rq->runtime_remaining when
cfs_b->runtime nears the sched_cfs_bandwidth_slice. This balances the
desire to prevent cfs_rq from always pulling quota with the desire to
allow applications to fully utilize their quota.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
---
 kernel/sched/fair.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..4894eda 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4695,7 +4695,9 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	return 1;
 }
 
-/* a cfs_rq won't donate quota below this amount */
+/* a cfs_rq won't donate quota below this amount unless cfs_b has very little
+ * remaining runtime.
+ */
 static const u64 min_cfs_rq_runtime = 1 * NSEC_PER_MSEC;
 /* minimum remaining period time to redistribute slack quota */
 static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC;
@@ -4743,16 +4745,27 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
 static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-	s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;
+	s64 slack_runtime = cfs_rq->runtime_remaining;
 
+	/* There is no runtime to return. */
 	if (slack_runtime <= 0)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
 	if (cfs_b->quota != RUNTIME_INF &&
 	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
-		cfs_b->runtime += slack_runtime;
+		/* As we near 0 quota remaining on cfs_b start returning all
+		 * remaining runtime. This avoids stranding and then expiring
+		 * runtime on per-cpu cfs_rq.
+		 *
+		 * cfs->b has plenty of runtime leave min_cfs_rq_runtime of
+		 * runtime on this cfs_rq.
+		 */
+		if (cfs_b->runtime >= sched_cfs_bandwidth_slice() * 3 &&
+		    slack_runtime > min_cfs_rq_runtime)
+			slack_runtime -= min_cfs_rq_runtime;
 
+		cfs_b->runtime += slack_runtime;
 		/* we are under rq->lock, defer unthrottling using a timer */
 		if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
 		    !list_empty(&cfs_b->throttled_cfs_rq))
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining.
  2019-06-24 15:50   ` [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining Dave Chiluk
@ 2019-06-24 17:33     ` bsegall
  2019-06-26 22:10       ` Dave Chiluk
  0 siblings, 1 reply; 40+ messages in thread
From: bsegall @ 2019-06-24 17:33 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos,
	John Hammond, Cong Wang

Dave Chiluk <chiluk+linux@indeed.com> writes:

> It has been observed, that highly-threaded, user-interactive
> applications running under cpu.cfs_quota_us constraints can hit a high
> percentage of periods throttled while simultaneously not consuming the
> allocated amount of quota. This impacts user-interactive non-cpu bound
> applications, such as those running in kubernetes or mesos when run on
> multiple cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. This
> results in min_cfs_rq_runtime remaining on each per-cpu cfs_rq. At the
> end of the period this remaining quota goes unused and expires. This
> expiration of unused time on per-cpu runqueues results in applications
> under-utilizing their quota while simultaneously hitting throttling.
>
> The solution is to return all spare cfs_rq->runtime_remaining when
> cfs_b->runtime nears the sched_cfs_bandwidth_slice. This balances the
> desire to prevent cfs_rq from always pulling quota with the desire to
> allow applications to fully utilize their quota.
>
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
> ---
>  kernel/sched/fair.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..4894eda 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4695,7 +4695,9 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
>  	return 1;
>  }
>  
> -/* a cfs_rq won't donate quota below this amount */
> +/* a cfs_rq won't donate quota below this amount unless cfs_b has very little
> + * remaining runtime.
> + */
>  static const u64 min_cfs_rq_runtime = 1 * NSEC_PER_MSEC;
>  /* minimum remaining period time to redistribute slack quota */
>  static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC;
> @@ -4743,16 +4745,27 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
>  static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  {
>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> -	s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;
> +	s64 slack_runtime = cfs_rq->runtime_remaining;
>  
> +	/* There is no runtime to return. */
>  	if (slack_runtime <= 0)
>  		return;
>  
>  	raw_spin_lock(&cfs_b->lock);
>  	if (cfs_b->quota != RUNTIME_INF &&
>  	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> -		cfs_b->runtime += slack_runtime;
> +		/* As we near 0 quota remaining on cfs_b start returning all
> +		 * remaining runtime. This avoids stranding and then expiring
> +		 * runtime on per-cpu cfs_rq.
> +		 *
> +		 * cfs->b has plenty of runtime leave min_cfs_rq_runtime of
> +		 * runtime on this cfs_rq.
> +		 */
> +		if (cfs_b->runtime >= sched_cfs_bandwidth_slice() * 3 &&
> +		    slack_runtime > min_cfs_rq_runtime)
> +			slack_runtime -= min_cfs_rq_runtime;
>  
> +		cfs_b->runtime += slack_runtime;
>  		/* we are under rq->lock, defer unthrottling using a timer */
>  		if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
>  		    !list_empty(&cfs_b->throttled_cfs_rq))


This still has a similar cost as reducing min_cfs_rq_runtime to 0 - we
now take a tg-global lock on every group se dequeue. Setting min=0 means
that we have to take it on both enqueue and dequeue, while baseline
means we take it once per min_cfs_rq_runtime in the worst case.

In addition how much this helps is very dependent on the exact pattern
of sleep/wake - you can still strand all but 15ms of runtime with a
pretty reasonable pattern.

If the cost of taking this global lock across all cpus without a
ratelimit was somehow not a problem, I'd much prefer to just set
min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
and sorta have 2x period 2x runtime" solution of removing expiration)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining.
  2019-06-24 17:33     ` bsegall
@ 2019-06-26 22:10       ` Dave Chiluk
  2019-06-27 20:18         ` bsegall
  0 siblings, 1 reply; 40+ messages in thread
From: Dave Chiluk @ 2019-06-26 22:10 UTC (permalink / raw)
  To: Ben Segall
  Cc: Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang

On Mon, Jun 24, 2019 at 10:33:07AM -0700, bsegall@google.com wrote:
> This still has a similar cost as reducing min_cfs_rq_runtime to 0 - we
> now take a tg-global lock on every group se dequeue. Setting min=0 means
> that we have to take it on both enqueue and dequeue, while baseline
> means we take it once per min_cfs_rq_runtime in the worst case.
Yep, it's only slightly better than simply setting min_cfs_rq_runtime=0.
There's definitely a tradeoff of having to grab the lock every time.

The other non-obvious benefit to this is that when the application
starts hitting throttling the entire application starts hitting
throttling closer to simultaneously.  Previously, threads that don't do
much could continue sipping on their 1ms of min_cfs_rq_runtime, while
main threads were throttled.  With this the application would more or
less halt within 5ms of full quota usage.

> In addition how much this helps is very dependent on the exact pattern
> of sleep/wake - you can still strand all but 15ms of runtime with a
> pretty reasonable pattern.

I thought this change would have an upper bound stranded time of:
NUMCPUs * min_cfs_rq_runtime - 3 * sched_cfs_bandwidth_slice().
From the stranding of 1ms on each queue minues the amount that we'd
return when we forcibly hit the 3 x bandwidth slice Am I missing
something here? Additionally that's worst case, and would require that
threads be scheduled on distinct cpus mostly sequentially, and then
never run again.

> If the cost of taking this global lock across all cpus without a
> ratelimit was somehow not a problem, I'd much prefer to just set
> min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
> and sorta have 2x period 2x runtime" solution of removing expiration)

Can I take this as an technical ack of the v3 patchset?  Do you have any
other comments for that patchset you'd like to see corrected before it
gets integrated?  If you are indeed good with that patchset, I'll clean
up the comment and Documentation text and re-submit as a v5 patchset. I

actually like that patchset best as well, for all the reasons already
discussed.  Especially the one where it's been running like this for 5
years prior to 512ac999.

Also given my new-found understanding of min_cfs_rq_runtime, doesn't
this mean that if we don't expire runtime we could potentially have a
worst case of 1ms * NUMCPUs of extra runtime instead of 2x runtime?
This is because min_cfs_rq_runtime could theoretically live on a per-cpu
indefinitely?  Still this is actually much smaller than I previously had
feared.  Also it still holds that average runtime is still bounded by
runtime/period on average.

Thanks

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk
                   ` (2 preceding siblings ...)
       [not found] ` <1561391404-14450-1-git-send-email-chiluk+linux@indeed.com>
@ 2019-06-27 19:09 ` Dave Chiluk
  2019-06-27 19:49 ` [PATCH v5 0/1] " Dave Chiluk
  2019-07-23 16:44 ` [PATCH v6 0/1] " Dave Chiluk
  5 siblings, 0 replies; 40+ messages in thread
From: Dave Chiluk @ 2019-06-27 19:09 UTC (permalink / raw)
  To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period. At which
point the slice and quota expires. This expiration of unused slice
results in applications not being able to utilize the quota for which
they are allocated.

The expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this has been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevents the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
---
 Documentation/scheduler/sched-bwc.txt | 73 ++++++++++++++++++++++++++++-------
 kernel/sched/fair.c                   | 71 +++-------------------------------
 kernel/sched/sched.h                  |  4 --
 3 files changed, 65 insertions(+), 83 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..4c19c94 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
 specification of the maximum CPU bandwidth available to a group or hierarchy.
 
 The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary.  As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis.  The amount transferred
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. That quota is assigned to per cpu run queues in
+slices as threads in the cgroup become runnable. Once all quota has been
+assigned any additional requests for quota will result in those threads being
+throttled. Throttled threads will not be able to run again until the next
+period when the quota is replenished.
+
+A group's unassigned quota is globally tracked, being refreshed back to
+cfs_quota units at each period boundary. As threads consume this bandwidth it
+is transferred to cpu-local "silos" on a demand basis. The amount transferred
 within each of these updates is tunable and described as the "slice".
 
 Management
@@ -33,12 +34,12 @@ The default values are:
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
-bandwidth group.  This represents the traditional work-conserving behavior for
+bandwidth group. This represents the traditional work-conserving behavior for
 CFS.
 
 Writing any (valid) positive value(s) will enact the specified bandwidth limit.
-The minimum quota allowed for the quota or period is 1ms.  There is also an
-upper bound on the period length of 1s.  Additional restrictions exist when
+The minimum quota allowed for the quota or period is 1ms. There is also an
+upper bound on the period length of 1s. Additional restrictions exist when
 bandwidth limits are used in a hierarchical fashion, these are explained in
 more detail below.
 
@@ -51,8 +52,8 @@ unthrottled if it is in a constrained state.
 System wide settings
 --------------------
 For efficiency run-time is transferred between the global pool and CPU local
-"silos" in a batch fashion.  This greatly reduces global accounting pressure
-on large systems.  The amount transferred each time such an update is required
+"silos" in a batch fashion. This greatly reduces global accounting pressure
+on large systems. The amount transferred each time such an update is required
 is described as the "slice".
 
 This is tunable via procfs:
@@ -90,6 +91,50 @@ There are two ways in which a group may become throttled:
 In case b) above, even though the child may have runtime remaining it will not
 be allowed to until the parent's runtime is refreshed.
 
+Real-world behavior
+-------------------
+Once a slice is assigned to a cpu it does not expire, but all except 1ms of it
+may be returned to the global pool if all threads on that cpu become
+unrunnable. This is configured at compile time by the min_cfs_rq_runtime
+variable.
+
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+However in a worst-case scenario, highly-threaded, interactive/non-cpu bound
+applications this non-expiration nuance allows applications to briefly burst
+past their quota limits by the amount of unused slice on each cpu that the task
+group is running on (typically at most 1ms per cpu). This slight burst
+only applies if quota had been assigned to a cpu and then not fully used or
+returned in previous periods. This burst amount will not be transferred
+between cores. As a result, this mechanism still strictly limits the task
+group to quota average usage, albeit over a longer time window than period.
+This also limits the burst ability to no more than 1ms per cpu. This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to remain valid across periods we have decreased the
+possibility of wasting quota on cpu-local silos that don't need a full slice's
+amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to 1ms additional quota in some periods, thereby preventing the
+cpu-bound application from fully using its quota by that same amount. In these
+instances it will be up to the CFS algorithm (see sched-design-CFS.txt) to
+decide which application is chosen to run, as they will both be runnable and
+have remaining quota. This runtime discrepancy will be made up in the following
+periods when the interactive application idles.
+
 Examples
 --------
 1. Limit a group to 1 CPU worth of runtime.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..a675c69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount, expires;
-	int expires_seq;
+	u64 amount = 0, min_amount;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
-	expires_seq = cfs_b->expires_seq;
-	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
-	/*
-	 * we may have advanced our local expiration to account for allowed
-	 * spread between our sched_clock and the one on which runtime was
-	 * issued.
-	 */
-	if (cfs_rq->expires_seq != expires_seq) {
-		cfs_rq->expires_seq = expires_seq;
-		cfs_rq->runtime_expires = expires;
-	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
-		return;
-
-	if (cfs_rq->runtime_remaining < 0)
-		return;
-
-	/*
-	 * If the local deadline has passed we have to consider the
-	 * possibility that our sched_clock is 'fast' and the global deadline
-	 * has not truly expired.
-	 *
-	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline(cfs_b->expires_seq) has advanced.
-	 */
-	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
-		/* extend local deadline, drift is bounded above by 2 ticks */
-		cfs_rq->runtime_expires += TICK_NSEC;
-	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
-	}
-}
-
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	expire_cfs_rq_runtime(cfs_rq);
 
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
@@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		resched_curr(rq);
 }
 
-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
-		u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime;
@@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		remaining -= runtime;
 
 		cfs_rq->runtime_remaining += runtime;
-		cfs_rq->runtime_expires = expires;
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0)
@@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
-	u64 runtime, runtime_expires;
+	u64 runtime;
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
@@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	/* account preceding periods in which throttling occurred */
 	cfs_b->nr_throttled += overrun;
 
-	runtime_expires = cfs_b->runtime_expires;
-
 	/*
 	 * This check is repeated as we are holding onto the new bandwidth while
 	 * we unthrottle. This can potentially race with an unthrottled group
@@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 		cfs_b->distribute_running = 1;
 		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
+		runtime = distribute_cfs_runtime(cfs_b, runtime);
 		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
@@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
@@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
 	unsigned long flags;
-	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	expires = cfs_b->runtime_expires;
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
@@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (!runtime)
 		return;
 
-	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+	runtime = distribute_cfs_runtime(cfs_b, runtime);
 
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
-	if (expires == cfs_b->runtime_expires)
-		lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
 	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
@@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 
 	cfs_b->period_active = 1;
 	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..0c0ed23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,8 +341,6 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	s64			hierarchical_quota;
-	u64			runtime_expires;
-	int			expires_seq;
 
 	short			idle;
 	short			period_active;
@@ -562,8 +560,6 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
-	int			expires_seq;
-	u64			runtime_expires;
 	s64			runtime_remaining;
 
 	u64			throttled_clock;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v5 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk
                   ` (3 preceding siblings ...)
  2019-06-27 19:09 ` [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk
@ 2019-06-27 19:49 ` Dave Chiluk
  2019-06-27 19:49   ` [PATCH v5 1/1] " Dave Chiluk
  2019-07-23 16:44 ` [PATCH v6 0/1] " Dave Chiluk
  5 siblings, 1 reply; 40+ messages in thread
From: Dave Chiluk @ 2019-06-27 19:49 UTC (permalink / raw)
  To: Ben Segall, Pqhil Auld, Peter Oskolkov, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc

Changelog v5
- Based on this comment from Ben Segall's comment on v4
> If the cost of taking this global lock across all cpus without a
> ratelimit was somehow not a problem, I'd much prefer to just set
> min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
> and sorta have 2x period 2x runtime" solution of removing expiration)
I'm resubmitting my v3 patchset, with the requested changes.
- Updated Commit log given review comments
- Update sched-bwc.txt give my new understanding of the slack timer.

Changelog v4
- Rewrote patchset around the concept of returning all of runtime_remaining
when cfs_b nears the end of available quota.

Changelog v3
- Reworked documentation to better describe behavior of slice expiration per
feedback from Peter Oskolkov

Changelog v2
- Fixed some checkpatch errors in the commit message.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-06-27 19:49 ` [PATCH v5 0/1] " Dave Chiluk
@ 2019-06-27 19:49   ` Dave Chiluk
  2019-07-01 20:15     ` bsegall
  2019-07-16 19:58     ` bsegall
  0 siblings, 2 replies; 40+ messages in thread
From: Dave Chiluk @ 2019-06-27 19:49 UTC (permalink / raw)
  To: Ben Segall, Pqhil Auld, Peter Oskolkov, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period. At which
point the slice and quota expires. This expiration of unused slice
results in applications not being able to utilize the quota for which
they are allocated.

The expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this has been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevents the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
---
 Documentation/scheduler/sched-bwc.txt | 73 ++++++++++++++++++++++++++++-------
 kernel/sched/fair.c                   | 71 +++-------------------------------
 kernel/sched/sched.h                  |  4 --
 3 files changed, 65 insertions(+), 83 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..4c19c94 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
 specification of the maximum CPU bandwidth available to a group or hierarchy.
 
 The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary.  As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis.  The amount transferred
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. That quota is assigned to per cpu run queues in
+slices as threads in the cgroup become runnable. Once all quota has been
+assigned any additional requests for quota will result in those threads being
+throttled. Throttled threads will not be able to run again until the next
+period when the quota is replenished.
+
+A group's unassigned quota is globally tracked, being refreshed back to
+cfs_quota units at each period boundary. As threads consume this bandwidth it
+is transferred to cpu-local "silos" on a demand basis. The amount transferred
 within each of these updates is tunable and described as the "slice".
 
 Management
@@ -33,12 +34,12 @@ The default values are:
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
-bandwidth group.  This represents the traditional work-conserving behavior for
+bandwidth group. This represents the traditional work-conserving behavior for
 CFS.
 
 Writing any (valid) positive value(s) will enact the specified bandwidth limit.
-The minimum quota allowed for the quota or period is 1ms.  There is also an
-upper bound on the period length of 1s.  Additional restrictions exist when
+The minimum quota allowed for the quota or period is 1ms. There is also an
+upper bound on the period length of 1s. Additional restrictions exist when
 bandwidth limits are used in a hierarchical fashion, these are explained in
 more detail below.
 
@@ -51,8 +52,8 @@ unthrottled if it is in a constrained state.
 System wide settings
 --------------------
 For efficiency run-time is transferred between the global pool and CPU local
-"silos" in a batch fashion.  This greatly reduces global accounting pressure
-on large systems.  The amount transferred each time such an update is required
+"silos" in a batch fashion. This greatly reduces global accounting pressure
+on large systems. The amount transferred each time such an update is required
 is described as the "slice".
 
 This is tunable via procfs:
@@ -90,6 +91,50 @@ There are two ways in which a group may become throttled:
 In case b) above, even though the child may have runtime remaining it will not
 be allowed to until the parent's runtime is refreshed.
 
+Real-world behavior
+-------------------
+Once a slice is assigned to a cpu it does not expire, but all except 1ms of it
+may be returned to the global pool if all threads on that cpu become
+unrunnable. This is configured at compile time by the min_cfs_rq_runtime
+variable.
+
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+However in a worst-case scenario, highly-threaded, interactive/non-cpu bound
+applications this non-expiration nuance allows applications to briefly burst
+past their quota limits by the amount of unused slice on each cpu that the task
+group is running on (typically at most 1ms per cpu). This slight burst
+only applies if quota had been assigned to a cpu and then not fully used or
+returned in previous periods. This burst amount will not be transferred
+between cores. As a result, this mechanism still strictly limits the task
+group to quota average usage, albeit over a longer time window than period.
+This also limits the burst ability to no more than 1ms per cpu. This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to remain valid across periods we have decreased the
+possibility of wasting quota on cpu-local silos that don't need a full slice's
+amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to 1ms additional quota in some periods, thereby preventing the
+cpu-bound application from fully using its quota by that same amount. In these
+instances it will be up to the CFS algorithm (see sched-design-CFS.txt) to
+decide which application is chosen to run, as they will both be runnable and
+have remaining quota. This runtime discrepancy will be made up in the following
+periods when the interactive application idles.
+
 Examples
 --------
 1. Limit a group to 1 CPU worth of runtime.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..a675c69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount, expires;
-	int expires_seq;
+	u64 amount = 0, min_amount;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
-	expires_seq = cfs_b->expires_seq;
-	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
-	/*
-	 * we may have advanced our local expiration to account for allowed
-	 * spread between our sched_clock and the one on which runtime was
-	 * issued.
-	 */
-	if (cfs_rq->expires_seq != expires_seq) {
-		cfs_rq->expires_seq = expires_seq;
-		cfs_rq->runtime_expires = expires;
-	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
-		return;
-
-	if (cfs_rq->runtime_remaining < 0)
-		return;
-
-	/*
-	 * If the local deadline has passed we have to consider the
-	 * possibility that our sched_clock is 'fast' and the global deadline
-	 * has not truly expired.
-	 *
-	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline(cfs_b->expires_seq) has advanced.
-	 */
-	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
-		/* extend local deadline, drift is bounded above by 2 ticks */
-		cfs_rq->runtime_expires += TICK_NSEC;
-	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
-	}
-}
-
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	expire_cfs_rq_runtime(cfs_rq);
 
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
@@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		resched_curr(rq);
 }
 
-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
-		u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime;
@@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		remaining -= runtime;
 
 		cfs_rq->runtime_remaining += runtime;
-		cfs_rq->runtime_expires = expires;
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0)
@@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
-	u64 runtime, runtime_expires;
+	u64 runtime;
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
@@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	/* account preceding periods in which throttling occurred */
 	cfs_b->nr_throttled += overrun;
 
-	runtime_expires = cfs_b->runtime_expires;
-
 	/*
 	 * This check is repeated as we are holding onto the new bandwidth while
 	 * we unthrottle. This can potentially race with an unthrottled group
@@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 		cfs_b->distribute_running = 1;
 		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
+		runtime = distribute_cfs_runtime(cfs_b, runtime);
 		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
@@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
@@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
 	unsigned long flags;
-	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	expires = cfs_b->runtime_expires;
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
@@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (!runtime)
 		return;
 
-	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+	runtime = distribute_cfs_runtime(cfs_b, runtime);
 
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
-	if (expires == cfs_b->runtime_expires)
-		lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
 	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
@@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 
 	cfs_b->period_active = 1;
 	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..0c0ed23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,8 +341,6 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	s64			hierarchical_quota;
-	u64			runtime_expires;
-	int			expires_seq;
 
 	short			idle;
 	short			period_active;
@@ -562,8 +560,6 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
-	int			expires_seq;
-	u64			runtime_expires;
 	s64			runtime_remaining;
 
 	u64			throttled_clock;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining.
  2019-06-26 22:10       ` Dave Chiluk
@ 2019-06-27 20:18         ` bsegall
  0 siblings, 0 replies; 40+ messages in thread
From: bsegall @ 2019-06-27 20:18 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang

Dave Chiluk <chiluk+linux@indeed.com> writes:

> On Mon, Jun 24, 2019 at 10:33:07AM -0700, bsegall@google.com wrote:
>> This still has a similar cost as reducing min_cfs_rq_runtime to 0 - we
>> now take a tg-global lock on every group se dequeue. Setting min=0 means
>> that we have to take it on both enqueue and dequeue, while baseline
>> means we take it once per min_cfs_rq_runtime in the worst case.
> Yep, it's only slightly better than simply setting min_cfs_rq_runtime=0.
> There's definitely a tradeoff of having to grab the lock every time.
>
> The other non-obvious benefit to this is that when the application
> starts hitting throttling the entire application starts hitting
> throttling closer to simultaneously.  Previously, threads that don't do
> much could continue sipping on their 1ms of min_cfs_rq_runtime, while
> main threads were throttled.  With this the application would more or
> less halt within 5ms of full quota usage.
>
>> In addition how much this helps is very dependent on the exact pattern
>> of sleep/wake - you can still strand all but 15ms of runtime with a
>> pretty reasonable pattern.
>
> I thought this change would have an upper bound stranded time of:
> NUMCPUs * min_cfs_rq_runtime - 3 * sched_cfs_bandwidth_slice().
> From the stranding of 1ms on each queue minues the amount that we'd
> return when we forcibly hit the 3 x bandwidth slice Am I missing
> something here? Additionally that's worst case, and would require that
> threads be scheduled on distinct cpus mostly sequentially, and then
> never run again.

Yeah, to be exact it's that. It's just that this is only relevant at all
when NUMCPUs*min_cfs_rq_runtime is at least a significant fraction of
quota, and when "scheduled on distinct cpus, and then never run again"
happens. How sequential it needs to be depends on the threads vs num
cpus.


>
>> If the cost of taking this global lock across all cpus without a
>> ratelimit was somehow not a problem, I'd much prefer to just set
>> min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
>> and sorta have 2x period 2x runtime" solution of removing expiration)
>
> Can I take this as an technical ack of the v3 patchset?  Do you have any
> other comments for that patchset you'd like to see corrected before it
> gets integrated?  If you are indeed good with that patchset, I'll clean
> up the comment and Documentation text and re-submit as a v5 patchset. I
>
> actually like that patchset best as well, for all the reasons already
> discussed.  Especially the one where it's been running like this for 5
> years prior to 512ac999.
>
> Also given my new-found understanding of min_cfs_rq_runtime, doesn't
> this mean that if we don't expire runtime we could potentially have a
> worst case of 1ms * NUMCPUs of extra runtime instead of 2x runtime?
> This is because min_cfs_rq_runtime could theoretically live on a per-cpu
> indefinitely?  Still this is actually much smaller than I previously had
> feared.  Also it still holds that average runtime is still bounded by
> runtime/period on average.

Yeah, I think I was misremembering or thinking only about one example or
something. Meanwhile I'm thinking this is potentially /worse/ than just
2x runtime per 2x period, though obviously it depends on cpus vs quota.


One thing I'm trying now is if we made the slack timer go and collect
all the min_cfs_rq_runtime left behind, which would provide something
similar to the min=0 case for fixing not only this but other cases of
stranding bandwidth, but also some level of rate limiting.



>
> Thanks

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-06-27 19:49   ` [PATCH v5 1/1] " Dave Chiluk
@ 2019-07-01 20:15     ` bsegall
  2019-07-11  9:51       ` Peter Zijlstra
  2019-07-16 19:58     ` bsegall
  1 sibling, 1 reply; 40+ messages in thread
From: bsegall @ 2019-07-01 20:15 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Pqhil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos,
	John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Paul Turner

Alright, this prototype of "maybe we should just 100% avoid stranding
runtime for a full period" appears to fix the fibtest synthetic example,
and seems like a theoretically-reasonable approach.

Things that may want improvement or at least thought (but it's a holiday
week in the US and I wanted any opinions on the basic approach):

- I don't like cfs_rq->slack_list, since logically it's mutually
  exclusive with throttled_list, but we have to iterate without
  locks, so I don't know that we can avoid it.

- I previously was using _rcu for the slack list, like throttled, but
  there is no list_for_each_entry_rcu_safe, so the list_del_init would
  be invalid and we'd have to use another flag or opencode the
  equivalent.

- (Actually, this just made me realize that distribute is sorta wrong if
  the unthrottled task immediately runs and rethrottles; this would just
  mean that we effectively restart the loop)

- We unconditionally start the slack timer, even if nothing is
  throttled. We could instead have throttle start the timer in this case
  (setting the timeout some definition of "appropriately"), but this
  bookkeeping would be a big hassle.

- We could try to do better about deciding what cfs_rqs are idle than
  "nr_running == 0", possibly requiring that to have been true for N<5
  ms, and restarting the slack timer if we didn't clear everything.

- Taking up-to-every rq->lock is bad and expensive and 5ms may be too
  short a delay for this. I haven't tried microbenchmarks on the cost of
  this vs min_cfs_rq_runtime = 0 vs baseline.

- runtime_expires vs expires_seq choice is basically rand(), much like
  the existing code. (I think the most consistent with existing code
  would be runtime_expires, since cfs_b lock is held; I think most
  consistent in general would change most of the existing ones as well
  to be seq)


-- >8 --
Subject: [PATCH] sched: avoid stranding cfs_bandwidth runtime

We avoid contention on the per-tg cfs_b->lock by keeping 1ms of runtime on a
cfs_rq even when all tasks in that cfs_rq dequeue. This way tasks doing frequent
wake/sleep can't hit this cross-cpu lock more than once per ms. This however
means that up to 1ms of runtime per cpu can be lost if no task does wake up on
that cpu, which is leading to issues on cgroups with low quota, many available
cpus, and a combination of threads that run for very little time and ones that
want to run constantly.

This was previously hidden by runtime expiration being broken, which allowed
this stranded runtime to be kept indefinitely across period resets. Thus after
an initial period or two any long-running tasks could use an appropriate portion
of their group's quota. The issue was that the group could also potentially
burst for 1ms * cpus more than their quota allowed, and in these situations this
is a significant increase.

Fix this by having the group's slack timer (which runs at most once per 5ms)
remove all runtime from empty cfs_rqs, not just redistribute any runtime above
that 1ms that was returned immediately.

Signed-off-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c  | 66 +++++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h |  2 ++
 2 files changed, 55 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 300f2c54dea5..80b2198d9b29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4745,23 +4745,32 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;
 
-	if (slack_runtime <= 0)
+	if (cfs_rq->runtime_remaining <= 0)
+		return;
+
+	if (slack_runtime <= 0 && !list_empty(&cfs_rq->slack_list))
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
 	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
-		cfs_b->runtime += slack_runtime;
+	    cfs_rq->expires_seq == cfs_b->expires_seq) {
+		if (slack_runtime > 0)
+			cfs_b->runtime += slack_runtime;
+		if (list_empty(&cfs_rq->slack_list))
+			list_add(&cfs_rq->slack_list, &cfs_b->slack_cfs_rq);
 
-		/* we are under rq->lock, defer unthrottling using a timer */
-		if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
-		    !list_empty(&cfs_b->throttled_cfs_rq))
-			start_cfs_slack_bandwidth(cfs_b);
+		/*
+		 * After a timeout, gather our remaining runtime so it can't get
+		 * stranded. We need a timer anyways to distribute any of the
+		 * runtime due to locking issues.
+		 */
+		start_cfs_slack_bandwidth(cfs_b);
 	}
 	raw_spin_unlock(&cfs_b->lock);
 
-	/* even if it's not valid for return we don't want to try again */
-	cfs_rq->runtime_remaining -= slack_runtime;
+	if (slack_runtime > 0)
+		/* even if it's not valid for return we don't want to try again */
+		cfs_rq->runtime_remaining -= slack_runtime;
 }
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
@@ -4781,12 +4790,41 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
  */
 static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
-	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+	u64 runtime = 0;
 	unsigned long flags;
 	u64 expires;
+	struct cfs_rq *cfs_rq, *temp;
+	LIST_HEAD(temp_head);
+
+	local_irq_save(flags);
+
+	raw_spin_lock(&cfs_b->lock);
+	cfs_b->slack_started = false;
+	list_splice_init(&cfs_b->slack_cfs_rq, &temp_head);
+	raw_spin_unlock(&cfs_b->lock);
+
+
+	/* Gather all left over runtime from all rqs */
+	list_for_each_entry_safe(cfs_rq, temp, &temp_head, slack_list) {
+		struct rq *rq = rq_of(cfs_rq);
+		struct rq_flags rf;
+
+		rq_lock(rq, &rf);
+
+		raw_spin_lock(&cfs_b->lock);
+		list_del_init(&cfs_rq->slack_list);
+		if (!cfs_rq->nr_running && cfs_rq->runtime_remaining > 0 &&
+		    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+			cfs_b->runtime += cfs_rq->runtime_remaining;
+			cfs_rq->runtime_remaining = 0;
+		}
+		raw_spin_unlock(&cfs_b->lock);
+
+		rq_unlock(rq, &rf);
+	}
 
 	/* confirm we're still not at a refresh boundary */
-	raw_spin_lock_irqsave(&cfs_b->lock, flags);
+	raw_spin_lock(&cfs_b->lock);
 	cfs_b->slack_started = false;
 	if (cfs_b->distribute_running) {
 		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
@@ -4798,7 +4836,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 		return;
 	}
 
-	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
+	if (cfs_b->quota != RUNTIME_INF)
 		runtime = cfs_b->runtime;
 
 	expires = cfs_b->runtime_expires;
@@ -4946,6 +4984,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
+	INIT_LIST_HEAD(&cfs_b->slack_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
 	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
@@ -4958,6 +4997,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->runtime_enabled = 0;
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
+	INIT_LIST_HEAD(&cfs_rq->slack_list);
 }
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b08dee29ef5e..3b272ee894fb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -345,6 +345,7 @@ struct cfs_bandwidth {
 	struct hrtimer		period_timer;
 	struct hrtimer		slack_timer;
 	struct list_head	throttled_cfs_rq;
+	struct list_head	slack_cfs_rq;
 
 	/* Statistics: */
 	int			nr_periods;
@@ -566,6 +567,7 @@ struct cfs_rq {
 	int			throttled;
 	int			throttle_count;
 	struct list_head	throttled_list;
+	struct list_head	slack_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 };
-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-07-01 20:15     ` bsegall
@ 2019-07-11  9:51       ` Peter Zijlstra
  2019-07-11 17:46         ` bsegall
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2019-07-11  9:51 UTC (permalink / raw)
  To: bsegall
  Cc: Dave Chiluk, Pqhil Auld, Peter Oskolkov, Ingo Molnar, cgroups,
	linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos,
	John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Paul Turner


FWIW, good to see progress, still waiting for you guys to agree :-)

On Mon, Jul 01, 2019 at 01:15:44PM -0700, bsegall@google.com wrote:

> - Taking up-to-every rq->lock is bad and expensive and 5ms may be too
>   short a delay for this. I haven't tried microbenchmarks on the cost of
>   this vs min_cfs_rq_runtime = 0 vs baseline.

Yes, that's tricky, SGI/HPE have definite ideas about that.

> @@ -4781,12 +4790,41 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>   */
>  static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  {
> -	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> +	u64 runtime = 0;
>  	unsigned long flags;
>  	u64 expires;
> +	struct cfs_rq *cfs_rq, *temp;
> +	LIST_HEAD(temp_head);
> +
> +	local_irq_save(flags);
> +
> +	raw_spin_lock(&cfs_b->lock);
> +	cfs_b->slack_started = false;
> +	list_splice_init(&cfs_b->slack_cfs_rq, &temp_head);
> +	raw_spin_unlock(&cfs_b->lock);
> +
> +
> +	/* Gather all left over runtime from all rqs */
> +	list_for_each_entry_safe(cfs_rq, temp, &temp_head, slack_list) {
> +		struct rq *rq = rq_of(cfs_rq);
> +		struct rq_flags rf;
> +
> +		rq_lock(rq, &rf);
> +
> +		raw_spin_lock(&cfs_b->lock);
> +		list_del_init(&cfs_rq->slack_list);
> +		if (!cfs_rq->nr_running && cfs_rq->runtime_remaining > 0 &&
> +		    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> +			cfs_b->runtime += cfs_rq->runtime_remaining;
> +			cfs_rq->runtime_remaining = 0;
> +		}
> +		raw_spin_unlock(&cfs_b->lock);
> +
> +		rq_unlock(rq, &rf);
> +	}

But worse still, you take possibly every rq->lock without ever
re-enabling IRQs.

>  
>  	/* confirm we're still not at a refresh boundary */
> -	raw_spin_lock_irqsave(&cfs_b->lock, flags);
> +	raw_spin_lock(&cfs_b->lock);
>  	cfs_b->slack_started = false;
>  	if (cfs_b->distribute_running) {
>  		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-07-11  9:51       ` Peter Zijlstra
@ 2019-07-11 17:46         ` bsegall
       [not found]           ` <CAC=E7cV4sO50NpYOZ06n_BkZTcBqf1KQp83prc+oave3ircBrw@mail.gmail.com>
  0 siblings, 1 reply; 40+ messages in thread
From: bsegall @ 2019-07-11 17:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Chiluk, Pqhil Auld, Peter Oskolkov, Ingo Molnar, cgroups,
	linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos,
	John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Paul Turner

Peter Zijlstra <peterz@infradead.org> writes:

> FWIW, good to see progress, still waiting for you guys to agree :-)
>
> On Mon, Jul 01, 2019 at 01:15:44PM -0700, bsegall@google.com wrote:
>
>> - Taking up-to-every rq->lock is bad and expensive and 5ms may be too
>>   short a delay for this. I haven't tried microbenchmarks on the cost of
>>   this vs min_cfs_rq_runtime = 0 vs baseline.
>
> Yes, that's tricky, SGI/HPE have definite ideas about that.
>
>> @@ -4781,12 +4790,41 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>   */
>>  static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>>  {
>> -	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
>> +	u64 runtime = 0;
>>  	unsigned long flags;
>>  	u64 expires;
>> +	struct cfs_rq *cfs_rq, *temp;
>> +	LIST_HEAD(temp_head);
>> +
>> +	local_irq_save(flags);
>> +
>> +	raw_spin_lock(&cfs_b->lock);
>> +	cfs_b->slack_started = false;
>> +	list_splice_init(&cfs_b->slack_cfs_rq, &temp_head);
>> +	raw_spin_unlock(&cfs_b->lock);
>> +
>> +
>> +	/* Gather all left over runtime from all rqs */
>> +	list_for_each_entry_safe(cfs_rq, temp, &temp_head, slack_list) {
>> +		struct rq *rq = rq_of(cfs_rq);
>> +		struct rq_flags rf;
>> +
>> +		rq_lock(rq, &rf);
>> +
>> +		raw_spin_lock(&cfs_b->lock);
>> +		list_del_init(&cfs_rq->slack_list);
>> +		if (!cfs_rq->nr_running && cfs_rq->runtime_remaining > 0 &&
>> +		    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
>> +			cfs_b->runtime += cfs_rq->runtime_remaining;
>> +			cfs_rq->runtime_remaining = 0;
>> +		}
>> +		raw_spin_unlock(&cfs_b->lock);
>> +
>> +		rq_unlock(rq, &rf);
>> +	}
>
> But worse still, you take possibly every rq->lock without ever
> re-enabling IRQs.
>

Yeah, I'm not sure why I did that, it isn't correctness.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
       [not found]           ` <CAC=E7cV4sO50NpYOZ06n_BkZTcBqf1KQp83prc+oave3ircBrw@mail.gmail.com>
@ 2019-07-12 18:01             ` bsegall
  2019-07-12 22:09             ` bsegall
  1 sibling, 0 replies; 40+ messages in thread
From: bsegall @ 2019-07-12 18:01 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Peter Zijlstra, Pqhil Auld, Peter Oskolkov, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc, Paul Turner

Dave Chiluk <chiluk+linux@indeed.com> writes:

> So I spent some more time testing this new patch as is *(interrupts disabled).  I know I probably should have fixed the patch, but it's hard to get time on big test hardware sometimes, and I was already well along my way with testing.
>
> In regards to the quota usage overage I was seeing earlier: I have a
> theory as to what might be happening here, and I'm pretty sure it's
> related to the IRQs being disabled during the rq->lock walk. I think
> that the main fast thread was able to use an excess amount of quota
> because the timer interrupt meant to stop it wasn't being handled
> timely due to the interrupts being disabled. On my 8 core machine this
> resulted in a what looked like simply improved usage of the quota, but
> when I ran the test on an 80 core machine I saw a massive overage of
> cpu usage when running fibtest. Specifically when running fibtest for
> 5 seconds with 50ms quota/100ms period expecting ~2500ms of quota
> usage; I got 3731 ms of cpu usage which was an unexpected overage of
> 1231ms. Is that a reasonable theory?

Tht doesn't seem likely - taking 1ms would be way longer than I'd expect
to begin with, and runtime_remaining can go negative for that sort of
reason anyways assuming the irq time is even counted towards the task.
Also I don't that the enable-irqs version will help for the scheduler
tick at least without rt patchsets.

That is still also too much for what I was thinking of though. I'll have
to look into this more.

>
> I'll try to get some time again tomorrow to test with IRQs disabled before the walk.  Ben if you have a chance to fix and resend the patch that'd help.
>
> I'm really starting to think that simply removing the quota expiration
> may be the best solution here.  Mathmatically it works out, it makes
> the code simpler, it doesn't have any of the lock walk issues, it
> doesn't add extra latency or overhead due to the slack timer,

It works out _for the job that is supposed to be throttled_. If the job
then gets a burst of actually-expensive work on many threads it can then
use NCPUs extra ms, adding latency to any other job on the system. Given
that it's still only 1ms on each runqueue, maybe this isn't the end of
the world, but the fail case does exist.

(We have to do exactly the same locking stuff on distribute, both more
rarely on the period timer, and on the currently existing slack timer)

> and that behavior is exactly what the kernel was doing for 5 years with few complaints about overage afaik.
>
> Either way, I'm very glad that we are getting to the end of this one, and all solutions appear to solve the core of the problem.  I thank you all the work you guys have put into this.
>
> On Thu, Jul 11, 2019 at 12:46 PM <bsegall@google.com> wrote:
>
>  Peter Zijlstra <peterz@infradead.org> writes:
>
>  > FWIW, good to see progress, still waiting for you guys to agree :-)
>  >
>  > On Mon, Jul 01, 2019 at 01:15:44PM -0700, bsegall@google.com wrote:
>  >
>  >> - Taking up-to-every rq->lock is bad and expensive and 5ms may be too
>  >>   short a delay for this. I haven't tried microbenchmarks on the cost of
>  >>   this vs min_cfs_rq_runtime = 0 vs baseline.
>  >
>  > Yes, that's tricky, SGI/HPE have definite ideas about that.
>  >
>  >> @@ -4781,12 +4790,41 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  >>   */
>  >>  static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  >>  {
>  >> -    u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
>  >> +    u64 runtime = 0;
>  >>      unsigned long flags;
>  >>      u64 expires;
>  >> +    struct cfs_rq *cfs_rq, *temp;
>  >> +    LIST_HEAD(temp_head);
>  >> +
>  >> +    local_irq_save(flags);
>  >> +
>  >> +    raw_spin_lock(&cfs_b->lock);
>  >> +    cfs_b->slack_started = false;
>  >> +    list_splice_init(&cfs_b->slack_cfs_rq, &temp_head);
>  >> +    raw_spin_unlock(&cfs_b->lock);
>  >> +
>  >> +
>  >> +    /* Gather all left over runtime from all rqs */
>  >> +    list_for_each_entry_safe(cfs_rq, temp, &temp_head, slack_list) {
>  >> +            struct rq *rq = rq_of(cfs_rq);
>  >> +            struct rq_flags rf;
>  >> +
>  >> +            rq_lock(rq, &rf);
>  >> +
>  >> +            raw_spin_lock(&cfs_b->lock);
>  >> +            list_del_init(&cfs_rq->slack_list);
>  >> +            if (!cfs_rq->nr_running && cfs_rq->runtime_remaining > 0 &&
>  >> +                cfs_rq->runtime_expires == cfs_b->runtime_expires) {
>  >> +                    cfs_b->runtime += cfs_rq->runtime_remaining;
>  >> +                    cfs_rq->runtime_remaining = 0;
>  >> +            }
>  >> +            raw_spin_unlock(&cfs_b->lock);
>  >> +
>  >> +            rq_unlock(rq, &rf);
>  >> +    }
>  >
>  > But worse still, you take possibly every rq->lock without ever
>  > re-enabling IRQs.
>  >
>
>  Yeah, I'm not sure why I did that, it isn't correctness.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
       [not found]           ` <CAC=E7cV4sO50NpYOZ06n_BkZTcBqf1KQp83prc+oave3ircBrw@mail.gmail.com>
  2019-07-12 18:01             ` bsegall
@ 2019-07-12 22:09             ` bsegall
  2019-07-15 15:44               ` Dave Chiluk
  1 sibling, 1 reply; 40+ messages in thread
From: bsegall @ 2019-07-12 22:09 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Peter Zijlstra, Pqhil Auld, Peter Oskolkov, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc, Paul Turner

Dave Chiluk <chiluk+linux@indeed.com> writes:

> So I spent some more time testing this new patch as is *(interrupts disabled).  I know I probably should have fixed the patch, but it's hard to get time on big test hardware sometimes, and I was already well along my way with testing.
>
> In regards to the quota usage overage I was seeing earlier: I have a theory as to what might be happening here, and I'm pretty sure it's related to the IRQs being disabled during the rq->lock walk.  I think that the main fast thread was able to use an excess amount
> of quota because the timer interrupt meant to stop it wasn't being handled timely due to the interrupts being disabled.  On my 8 core machine this resulted in a what looked like simply improved usage of the quota, but when I ran the test on an 80 core machine I
> saw a massive overage of cpu usage when running fibtest.  Specifically when running fibtest for 5 seconds with 50ms quota/100ms period expecting ~2500ms of quota usage; I got 3731 ms of cpu usage which was an unexpected overage of 1231ms. Is that a
> reasonable theory?

I think I've figured out what's going on here (and a related issue
that gave me some inconsistency when trying to debug it): other "slow"
threads can wake up while the slack timer is in distribute and
double-spend some runtime. Since we lsub_positive rather than allow
cfs_b->runtime to be negative this double-spending is permanent, and can
go on indefinitely.

In addition, if things fall out in a slightly different way, all the
"slow" threads can wind up getting on cpu and claiming slices of runtime
before the "fast" thread, and then it just has to wait another slack
period to hope that the ordering winds up better that time. This just
depends on things like IPI latency and maybe what order things happened
to happen at the start of the period.

Ugh. Maybe we /do/ just give up and say that most people don't seem to
be using cfs_b in a way that expiration of the leftover 1ms matters.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-07-12 22:09             ` bsegall
@ 2019-07-15 15:44               ` Dave Chiluk
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chiluk @ 2019-07-15 15:44 UTC (permalink / raw)
  To: Ben Segall
  Cc: Peter Zijlstra, Pqhil Auld, Peter Oskolkov, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc, Paul Turner

On Fri, Jul 12, 2019 at 5:10 PM <bsegall@google.com> wrote:
> Ugh. Maybe we /do/ just give up and say that most people don't seem to
> be using cfs_b in a way that expiration of the leftover 1ms matters.

That was my conclusion as well.  Does this mean you want to proceed
with my patch set?  Do you have any changes you want made to the
proposed documentation changes, or any other changes for that matter?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-06-27 19:49   ` [PATCH v5 1/1] " Dave Chiluk
  2019-07-01 20:15     ` bsegall
@ 2019-07-16 19:58     ` bsegall
  1 sibling, 0 replies; 40+ messages in thread
From: bsegall @ 2019-07-16 19:58 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Pqhil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos,
	John Hammond, Cong Wang, Jonathan Corbet, linux-doc

Dave Chiluk <chiluk+linux@indeed.com> writes:

> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota. This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. At which
> point the slice and quota expires. This expiration of unused slice
> results in applications not being able to utilize the quota for which
> they are allocated.
>
> The expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'. Prior to that it appears that this has been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> added the following conditional which resulted in slices never being
> expired.
>
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> 	/* extend local deadline, drift is bounded above by 2 ticks */
> 	cfs_rq->runtime_expires += TICK_NSEC;
>
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
>
> This allows quota already allocated to per-cpu run-queues to live longer
> than the period boundary. This allows threads on runqueues that do not
> use much CPU to continue to use their remaining slice over a longer
> period of time than cpu.cfs_period_us. However, this helps prevents the
> above condition of hitting throttling while also not fully utilizing
> your cpu quota.
>
> This theoretically allows a machine to use slightly more than its
> allotted quota in some periods. This overflow would be bounded by the
> remaining quota left on each per-cpu runqueueu. This is typically no
> more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
> change nothing, as they should theoretically fully utilize all of their
> quota in each period. For user-interactive tasks as described above this
> provides a much better user/application experience as their cpu
> utilization will more closely match the amount they requested when they
> hit throttling. This means that cpu limits no longer strictly apply per
> period for non-cpu bound applications, but that they are still accurate
> over longer timeframes.
>
> This greatly improves performance of high-thread-count, non-cpu bound
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase (10ms/100ms of quota on
> 80 CPU machine), this commit resulted in almost 30x performance
> improvement, while still maintaining correct cpu quota restrictions.
> That testcase is available at https://github.com/indeedeng/fibtest.
>
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
> ---
>  Documentation/scheduler/sched-bwc.txt | 73 ++++++++++++++++++++++++++++-------
>  kernel/sched/fair.c                   | 71 +++-------------------------------
>  kernel/sched/sched.h                  |  4 --
>  3 files changed, 65 insertions(+), 83 deletions(-)
>
> [...]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..a675c69 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  	if (!runtime)
>  		return;
>  
> -	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> +	runtime = distribute_cfs_runtime(cfs_b, runtime);
>  
>  	raw_spin_lock_irqsave(&cfs_b->lock, flags);
> -	if (expires == cfs_b->runtime_expires)
> -		lsub_positive(&cfs_b->runtime, runtime);

The lsub_positive is still needed, just get rid of the if.


Other than that,
Reviewed-by: Ben Segall <bsegall@google.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v6 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk
                   ` (4 preceding siblings ...)
  2019-06-27 19:49 ` [PATCH v5 0/1] " Dave Chiluk
@ 2019-07-23 16:44 ` Dave Chiluk
  2019-07-23 16:44   ` [PATCH v6 1/1] " Dave Chiluk
  5 siblings, 1 reply; 40+ messages in thread
From: Dave Chiluk @ 2019-07-23 16:44 UTC (permalink / raw)
  To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc

Changelog v6
- Added back missing call to lsub_positive(&cfs_b->runtime, runtime);
- Added Reviewed-by: Ben Segall <bsegall@google.com>
- Fix some grammar in the Documentation, and change some wording.
- Updated documentation due to the .rst change

Changelog v5
- Based on this comment from Ben Segall's comment on v4
> If the cost of taking this global lock across all cpus without a
> ratelimit was somehow not a problem, I'd much prefer to just set
> min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
> and sorta have 2x period 2x runtime" solution of removing expiration)
I'm resubmitting my v3 patchset, with the requested changes.
- Updated Commit log given review comments
- Update sched-bwc.txt give my new understanding of the slack timer.

Changelog v4
- Rewrote patchset around the concept of returning all of runtime_remaining
when cfs_b nears the end of available quota.

Changelog v3
- Reworked documentation to better describe behavior of slice expiration per
feedback from Peter Oskolkov

Changelog v2
- Fixed some checkpatch errors in the commit message.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-07-23 16:44 ` [PATCH v6 0/1] " Dave Chiluk
@ 2019-07-23 16:44   ` Dave Chiluk
  2019-07-23 17:13     ` Phil Auld
  2019-08-08 10:53     ` [tip:sched/core] " tip-bot for Dave Chiluk
  0 siblings, 2 replies; 40+ messages in thread
From: Dave Chiluk @ 2019-07-23 16:44 UTC (permalink / raw)
  To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 Documentation/scheduler/sched-bwc.rst | 74 ++++++++++++++++++++++++++++-------
 kernel/sched/fair.c                   | 72 ++++------------------------------
 kernel/sched/sched.h                  |  4 --
 3 files changed, 67 insertions(+), 83 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
index 3a90642..9801d6b 100644
--- a/Documentation/scheduler/sched-bwc.rst
+++ b/Documentation/scheduler/sched-bwc.rst
@@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
 specification of the maximum CPU bandwidth available to a group or hierarchy.
 
 The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary.  As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis.  The amount transferred
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. That quota is assigned to per-cpu run queues in
+slices as threads in the cgroup become runnable. Once all quota has been
+assigned any additional requests for quota will result in those threads being
+throttled. Throttled threads will not be able to run again until the next
+period when the quota is replenished.
+
+A group's unassigned quota is globally tracked, being refreshed back to
+cfs_quota units at each period boundary. As threads consume this bandwidth it
+is transferred to cpu-local "silos" on a demand basis. The amount transferred
 within each of these updates is tunable and described as the "slice".
 
 Management
@@ -35,12 +36,12 @@ The default values are::
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
-bandwidth group.  This represents the traditional work-conserving behavior for
+bandwidth group. This represents the traditional work-conserving behavior for
 CFS.
 
 Writing any (valid) positive value(s) will enact the specified bandwidth limit.
-The minimum quota allowed for the quota or period is 1ms.  There is also an
-upper bound on the period length of 1s.  Additional restrictions exist when
+The minimum quota allowed for the quota or period is 1ms. There is also an
+upper bound on the period length of 1s. Additional restrictions exist when
 bandwidth limits are used in a hierarchical fashion, these are explained in
 more detail below.
 
@@ -53,8 +54,8 @@ unthrottled if it is in a constrained state.
 System wide settings
 --------------------
 For efficiency run-time is transferred between the global pool and CPU local
-"silos" in a batch fashion.  This greatly reduces global accounting pressure
-on large systems.  The amount transferred each time such an update is required
+"silos" in a batch fashion. This greatly reduces global accounting pressure
+on large systems. The amount transferred each time such an update is required
 is described as the "slice".
 
 This is tunable via procfs::
@@ -97,6 +98,51 @@ There are two ways in which a group may become throttled:
 In case b) above, even though the child may have runtime remaining it will not
 be allowed to until the parent's runtime is refreshed.
 
+CFS Bandwidth Quota Caveats
+---------------------------
+Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
+the slice may be returned to the global pool if all threads on that cpu become
+unrunnable. This is configured at compile time by the min_cfs_rq_runtime
+variable. This is a performance tweak that helps prevent added contention on
+the global lock.
+
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+For highly-threaded, non-cpu bound applications this non-expiration nuance
+allows applications to briefly burst past their quota limits by the amount of
+unused slice on each cpu that the task group is running on (typically at most
+1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
+applies if quota had been assigned to a cpu and then not fully used or returned
+in previous periods. This burst amount will not be transferred between cores.
+As a result, this mechanism still strictly limits the task group to quota
+average usage, albeit over a longer time window than a single period.  This
+also limits the burst ability to no more than 1ms per cpu.  This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to remain valid across periods we have decreased the
+possibility of wastefully expiring quota on cpu-local silos that don't need a
+full slice's amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to 1ms additional quota in some periods, thereby preventing the
+cpu-bound application from fully using its quota by that same amount. In these
+instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
+decide which application is chosen to run, as they will both be runnable and
+have remaining quota. This runtime discrepancy will be made up in the following
+periods when the interactive application idles.
+
 Examples
 --------
 1. Limit a group to 1 CPU worth of runtime::
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 036be95..00b68f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4316,8 +4316,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4339,8 +4337,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount, expires;
-	int expires_seq;
+	u64 amount = 0, min_amount;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4357,61 +4354,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
-	expires_seq = cfs_b->expires_seq;
-	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
-	/*
-	 * we may have advanced our local expiration to account for allowed
-	 * spread between our sched_clock and the one on which runtime was
-	 * issued.
-	 */
-	if (cfs_rq->expires_seq != expires_seq) {
-		cfs_rq->expires_seq = expires_seq;
-		cfs_rq->runtime_expires = expires;
-	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
-		return;
-
-	if (cfs_rq->runtime_remaining < 0)
-		return;
-
-	/*
-	 * If the local deadline has passed we have to consider the
-	 * possibility that our sched_clock is 'fast' and the global deadline
-	 * has not truly expired.
-	 *
-	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline(cfs_b->expires_seq) has advanced.
-	 */
-	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
-		/* extend local deadline, drift is bounded above by 2 ticks */
-		cfs_rq->runtime_expires += TICK_NSEC;
-	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
-	}
-}
-
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	expire_cfs_rq_runtime(cfs_rq);
 
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
@@ -4602,8 +4555,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		resched_curr(rq);
 }
 
-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
-		u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime;
@@ -4625,7 +4577,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		remaining -= runtime;
 
 		cfs_rq->runtime_remaining += runtime;
-		cfs_rq->runtime_expires = expires;
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0)
@@ -4650,7 +4601,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
-	u64 runtime, runtime_expires;
+	u64 runtime;
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
@@ -4678,8 +4629,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	/* account preceding periods in which throttling occurred */
 	cfs_b->nr_throttled += overrun;
 
-	runtime_expires = cfs_b->runtime_expires;
-
 	/*
 	 * This check is repeated as we are holding onto the new bandwidth while
 	 * we unthrottle. This can potentially race with an unthrottled group
@@ -4692,8 +4641,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 		cfs_b->distribute_running = 1;
 		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
+		runtime = distribute_cfs_runtime(cfs_b, runtime);
 		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
@@ -4775,8 +4723,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
@@ -4809,7 +4756,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
 	unsigned long flags;
-	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4827,7 +4773,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	expires = cfs_b->runtime_expires;
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
@@ -4836,11 +4781,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (!runtime)
 		return;
 
-	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+	runtime = distribute_cfs_runtime(cfs_b, runtime);
 
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
-	if (expires == cfs_b->runtime_expires)
-		lsub_positive(&cfs_b->runtime, runtime);
+	lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
 	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
@@ -4997,8 +4941,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 
 	cfs_b->period_active = 1;
 	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 802b1f3..28c16e9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -335,8 +335,6 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	s64			hierarchical_quota;
-	u64			runtime_expires;
-	int			expires_seq;
 
 	u8			idle;
 	u8			period_active;
@@ -556,8 +554,6 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
-	int			expires_seq;
-	u64			runtime_expires;
 	s64			runtime_remaining;
 
 	u64			throttled_clock;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-07-23 16:44   ` [PATCH v6 1/1] " Dave Chiluk
@ 2019-07-23 17:13     ` Phil Auld
  2019-07-23 22:12       ` Dave Chiluk
  2019-07-26 18:14       ` Peter Zijlstra
  2019-08-08 10:53     ` [tip:sched/core] " tip-bot for Dave Chiluk
  1 sibling, 2 replies; 40+ messages in thread
From: Phil Auld @ 2019-07-23 17:13 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Ben Segall, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos,
	John Hammond, Cong Wang, Jonathan Corbet, linux-doc

Hi Dave,

On Tue, Jul 23, 2019 at 11:44:26AM -0500 Dave Chiluk wrote:
> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota. This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
> 
> This has been root caused to cpu-local run queue being allocated per cpu
> bandwidth slices, and then not fully using that slice within the period.
> At which point the slice and quota expires. This expiration of unused
> slice results in applications not being able to utilize the quota for
> which they are allocated.
> 
> The non-expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'. Prior to that it appears that this had been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> added the following conditional which resulted in slices never being
> expired.
> 
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> 	/* extend local deadline, drift is bounded above by 2 ticks */
> 	cfs_rq->runtime_expires += TICK_NSEC;
> 
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
> 
> This allows quota already allocated to per-cpu run-queues to live longer
> than the period boundary. This allows threads on runqueues that do not
> use much CPU to continue to use their remaining slice over a longer
> period of time than cpu.cfs_period_us. However, this helps prevent the
> above condition of hitting throttling while also not fully utilizing
> your cpu quota.
> 
> This theoretically allows a machine to use slightly more than its
> allotted quota in some periods. This overflow would be bounded by the
> remaining quota left on each per-cpu runqueueu. This is typically no
> more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
> change nothing, as they should theoretically fully utilize all of their
> quota in each period. For user-interactive tasks as described above this
> provides a much better user/application experience as their cpu
> utilization will more closely match the amount they requested when they
> hit throttling. This means that cpu limits no longer strictly apply per
> period for non-cpu bound applications, but that they are still accurate
> over longer timeframes.
> 
> This greatly improves performance of high-thread-count, non-cpu bound
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase (10ms/100ms of quota on
> 80 CPU machine), this commit resulted in almost 30x performance
> improvement, while still maintaining correct cpu quota restrictions.
> That testcase is available at https://github.com/indeedeng/fibtest.
> 
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
> Reviewed-by: Ben Segall <bsegall@google.com>

This still works for me. The documentation reads pretty well, too. Good job.

Feel free to add my Acked-by: or Reviewed-by: Phil Auld <pauld@redhat.com>.

I'll run it through some more tests when I have time. The code is the same
as the earlier one I tested from what I can see.

Cheers,
Phil

> ---
>  Documentation/scheduler/sched-bwc.rst | 74 ++++++++++++++++++++++++++++-------
>  kernel/sched/fair.c                   | 72 ++++------------------------------
>  kernel/sched/sched.h                  |  4 --
>  3 files changed, 67 insertions(+), 83 deletions(-)
> 
> diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
> index 3a90642..9801d6b 100644
> --- a/Documentation/scheduler/sched-bwc.rst
> +++ b/Documentation/scheduler/sched-bwc.rst
> @@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
>  specification of the maximum CPU bandwidth available to a group or hierarchy.
>  
>  The bandwidth allowed for a group is specified using a quota and period. Within
> -each given "period" (microseconds), a group is allowed to consume only up to
> -"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
> -group exceeds this limit (for that period), the tasks belonging to its
> -hierarchy will be throttled and are not allowed to run again until the next
> -period.
> -
> -A group's unused runtime is globally tracked, being refreshed with quota units
> -above at each period boundary.  As threads consume this bandwidth it is
> -transferred to cpu-local "silos" on a demand basis.  The amount transferred
> +each given "period" (microseconds), a task group is allocated up to "quota"
> +microseconds of CPU time. That quota is assigned to per-cpu run queues in
> +slices as threads in the cgroup become runnable. Once all quota has been
> +assigned any additional requests for quota will result in those threads being
> +throttled. Throttled threads will not be able to run again until the next
> +period when the quota is replenished.
> +
> +A group's unassigned quota is globally tracked, being refreshed back to
> +cfs_quota units at each period boundary. As threads consume this bandwidth it
> +is transferred to cpu-local "silos" on a demand basis. The amount transferred
>  within each of these updates is tunable and described as the "slice".
>  
>  Management
> @@ -35,12 +36,12 @@ The default values are::
>  
>  A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
>  bandwidth restriction in place, such a group is described as an unconstrained
> -bandwidth group.  This represents the traditional work-conserving behavior for
> +bandwidth group. This represents the traditional work-conserving behavior for
>  CFS.
>  
>  Writing any (valid) positive value(s) will enact the specified bandwidth limit.
> -The minimum quota allowed for the quota or period is 1ms.  There is also an
> -upper bound on the period length of 1s.  Additional restrictions exist when
> +The minimum quota allowed for the quota or period is 1ms. There is also an
> +upper bound on the period length of 1s. Additional restrictions exist when
>  bandwidth limits are used in a hierarchical fashion, these are explained in
>  more detail below.
>  
> @@ -53,8 +54,8 @@ unthrottled if it is in a constrained state.
>  System wide settings
>  --------------------
>  For efficiency run-time is transferred between the global pool and CPU local
> -"silos" in a batch fashion.  This greatly reduces global accounting pressure
> -on large systems.  The amount transferred each time such an update is required
> +"silos" in a batch fashion. This greatly reduces global accounting pressure
> +on large systems. The amount transferred each time such an update is required
>  is described as the "slice".
>  
>  This is tunable via procfs::
> @@ -97,6 +98,51 @@ There are two ways in which a group may become throttled:
>  In case b) above, even though the child may have runtime remaining it will not
>  be allowed to until the parent's runtime is refreshed.
>  
> +CFS Bandwidth Quota Caveats
> +---------------------------
> +Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
> +the slice may be returned to the global pool if all threads on that cpu become
> +unrunnable. This is configured at compile time by the min_cfs_rq_runtime
> +variable. This is a performance tweak that helps prevent added contention on
> +the global lock.
> +
> +The fact that cpu-local slices do not expire results in some interesting corner
> +cases that should be understood.
> +
> +For cgroup cpu constrained applications that are cpu limited this is a
> +relatively moot point because they will naturally consume the entirety of their
> +quota as well as the entirety of each cpu-local slice in each period. As a
> +result it is expected that nr_periods roughly equal nr_throttled, and that
> +cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
> +
> +For highly-threaded, non-cpu bound applications this non-expiration nuance
> +allows applications to briefly burst past their quota limits by the amount of
> +unused slice on each cpu that the task group is running on (typically at most
> +1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
> +applies if quota had been assigned to a cpu and then not fully used or returned
> +in previous periods. This burst amount will not be transferred between cores.
> +As a result, this mechanism still strictly limits the task group to quota
> +average usage, albeit over a longer time window than a single period.  This
> +also limits the burst ability to no more than 1ms per cpu.  This provides
> +better more predictable user experience for highly threaded applications with
> +small quota limits on high core count machines. It also eliminates the
> +propensity to throttle these applications while simultanously using less than
> +quota amounts of cpu. Another way to say this, is that by allowing the unused
> +portion of a slice to remain valid across periods we have decreased the
> +possibility of wastefully expiring quota on cpu-local silos that don't need a
> +full slice's amount of cpu time.
> +
> +The interaction between cpu-bound and non-cpu-bound-interactive applications
> +should also be considered, especially when single core usage hits 100%. If you
> +gave each of these applications half of a cpu-core and they both got scheduled
> +on the same CPU it is theoretically possible that the non-cpu bound application
> +will use up to 1ms additional quota in some periods, thereby preventing the
> +cpu-bound application from fully using its quota by that same amount. In these
> +instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
> +decide which application is chosen to run, as they will both be runnable and
> +have remaining quota. This runtime discrepancy will be made up in the following
> +periods when the interactive application idles.
> +
>  Examples
>  --------
>  1. Limit a group to 1 CPU worth of runtime::
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 036be95..00b68f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4316,8 +4316,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>  
>  	now = sched_clock_cpu(smp_processor_id());
>  	cfs_b->runtime = cfs_b->quota;
> -	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> -	cfs_b->expires_seq++;
>  }
>  
>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> @@ -4339,8 +4337,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  {
>  	struct task_group *tg = cfs_rq->tg;
>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> -	u64 amount = 0, min_amount, expires;
> -	int expires_seq;
> +	u64 amount = 0, min_amount;
>  
>  	/* note: this is a positive sum as runtime_remaining <= 0 */
>  	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> @@ -4357,61 +4354,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  			cfs_b->idle = 0;
>  		}
>  	}
> -	expires_seq = cfs_b->expires_seq;
> -	expires = cfs_b->runtime_expires;
>  	raw_spin_unlock(&cfs_b->lock);
>  
>  	cfs_rq->runtime_remaining += amount;
> -	/*
> -	 * we may have advanced our local expiration to account for allowed
> -	 * spread between our sched_clock and the one on which runtime was
> -	 * issued.
> -	 */
> -	if (cfs_rq->expires_seq != expires_seq) {
> -		cfs_rq->expires_seq = expires_seq;
> -		cfs_rq->runtime_expires = expires;
> -	}
>  
>  	return cfs_rq->runtime_remaining > 0;
>  }
>  
> -/*
> - * Note: This depends on the synchronization provided by sched_clock and the
> - * fact that rq->clock snapshots this value.
> - */
> -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> -{
> -	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> -
> -	/* if the deadline is ahead of our clock, nothing to do */
> -	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> -		return;
> -
> -	if (cfs_rq->runtime_remaining < 0)
> -		return;
> -
> -	/*
> -	 * If the local deadline has passed we have to consider the
> -	 * possibility that our sched_clock is 'fast' and the global deadline
> -	 * has not truly expired.
> -	 *
> -	 * Fortunately we can check determine whether this the case by checking
> -	 * whether the global deadline(cfs_b->expires_seq) has advanced.
> -	 */
> -	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
> -		/* extend local deadline, drift is bounded above by 2 ticks */
> -		cfs_rq->runtime_expires += TICK_NSEC;
> -	} else {
> -		/* global deadline is ahead, expiration has passed */
> -		cfs_rq->runtime_remaining = 0;
> -	}
> -}
> -
>  static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>  {
>  	/* dock delta_exec before expiring quota (as it could span periods) */
>  	cfs_rq->runtime_remaining -= delta_exec;
> -	expire_cfs_rq_runtime(cfs_rq);
>  
>  	if (likely(cfs_rq->runtime_remaining > 0))
>  		return;
> @@ -4602,8 +4555,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  		resched_curr(rq);
>  }
>  
> -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> -		u64 remaining, u64 expires)
> +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>  {
>  	struct cfs_rq *cfs_rq;
>  	u64 runtime;
> @@ -4625,7 +4577,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>  		remaining -= runtime;
>  
>  		cfs_rq->runtime_remaining += runtime;
> -		cfs_rq->runtime_expires = expires;
>  
>  		/* we check whether we're throttled above */
>  		if (cfs_rq->runtime_remaining > 0)
> @@ -4650,7 +4601,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>   */
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
>  {
> -	u64 runtime, runtime_expires;
> +	u64 runtime;
>  	int throttled;
>  
>  	/* no need to continue the timer with no bandwidth constraint */
> @@ -4678,8 +4629,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
>  	/* account preceding periods in which throttling occurred */
>  	cfs_b->nr_throttled += overrun;
>  
> -	runtime_expires = cfs_b->runtime_expires;
> -
>  	/*
>  	 * This check is repeated as we are holding onto the new bandwidth while
>  	 * we unthrottle. This can potentially race with an unthrottled group
> @@ -4692,8 +4641,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
>  		cfs_b->distribute_running = 1;
>  		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  		/* we can't nest cfs_b->lock while distributing bandwidth */
> -		runtime = distribute_cfs_runtime(cfs_b, runtime,
> -						 runtime_expires);
> +		runtime = distribute_cfs_runtime(cfs_b, runtime);
>  		raw_spin_lock_irqsave(&cfs_b->lock, flags);
>  
>  		cfs_b->distribute_running = 0;
> @@ -4775,8 +4723,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  		return;
>  
>  	raw_spin_lock(&cfs_b->lock);
> -	if (cfs_b->quota != RUNTIME_INF &&
> -	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> +	if (cfs_b->quota != RUNTIME_INF) {
>  		cfs_b->runtime += slack_runtime;
>  
>  		/* we are under rq->lock, defer unthrottling using a timer */
> @@ -4809,7 +4756,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  {
>  	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
>  	unsigned long flags;
> -	u64 expires;
>  
>  	/* confirm we're still not at a refresh boundary */
>  	raw_spin_lock_irqsave(&cfs_b->lock, flags);
> @@ -4827,7 +4773,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
>  		runtime = cfs_b->runtime;
>  
> -	expires = cfs_b->runtime_expires;
>  	if (runtime)
>  		cfs_b->distribute_running = 1;
>  
> @@ -4836,11 +4781,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  	if (!runtime)
>  		return;
>  
> -	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> +	runtime = distribute_cfs_runtime(cfs_b, runtime);
>  
>  	raw_spin_lock_irqsave(&cfs_b->lock, flags);
> -	if (expires == cfs_b->runtime_expires)
> -		lsub_positive(&cfs_b->runtime, runtime);
> +	lsub_positive(&cfs_b->runtime, runtime);
>  	cfs_b->distribute_running = 0;
>  	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  }
> @@ -4997,8 +4941,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  
>  	cfs_b->period_active = 1;
>  	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> -	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> -	cfs_b->expires_seq++;
>  	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>  }
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 802b1f3..28c16e9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -335,8 +335,6 @@ struct cfs_bandwidth {
>  	u64			quota;
>  	u64			runtime;
>  	s64			hierarchical_quota;
> -	u64			runtime_expires;
> -	int			expires_seq;
>  
>  	u8			idle;
>  	u8			period_active;
> @@ -556,8 +554,6 @@ struct cfs_rq {
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
>  	int			runtime_enabled;
> -	int			expires_seq;
> -	u64			runtime_expires;
>  	s64			runtime_remaining;
>  
>  	u64			throttled_clock;
> -- 
> 1.8.3.1
> 

-- 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-07-23 17:13     ` Phil Auld
@ 2019-07-23 22:12       ` Dave Chiluk
  2019-07-23 23:26         ` Phil Auld
  2019-07-26 18:14       ` Peter Zijlstra
  1 sibling, 1 reply; 40+ messages in thread
From: Dave Chiluk @ 2019-07-23 22:12 UTC (permalink / raw)
  To: Phil Auld
  Cc: Ben Segall, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc

Thanks for all the help and testing you provided.  It's good to know
these changes have passed at least some scheduler regression tests.
If it comes to a v7 I'll add the Reviewed-by, otherwise I'll just let
Peter add it.

Will you be handling the backport into the RHEL 8 kernels?  I'll
submit this to Ubuntu and linux-stable once it gets accepted.

Thanks again,


On Tue, Jul 23, 2019 at 12:13 PM Phil Auld <pauld@redhat.com> wrote:
>
> Hi Dave,
>
> On Tue, Jul 23, 2019 at 11:44:26AM -0500 Dave Chiluk wrote:
> > It has been observed, that highly-threaded, non-cpu-bound applications
> > running under cpu.cfs_quota_us constraints can hit a high percentage of
> > periods throttled while simultaneously not consuming the allocated
> > amount of quota. This use case is typical of user-interactive non-cpu
> > bound applications, such as those running in kubernetes or mesos when
> > run on multiple cpu cores.
> >
> > This has been root caused to cpu-local run queue being allocated per cpu
> > bandwidth slices, and then not fully using that slice within the period.
> > At which point the slice and quota expires. This expiration of unused
> > slice results in applications not being able to utilize the quota for
> > which they are allocated.
> >
> > The non-expiration of per-cpu slices was recently fixed by
> > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> > condition")'. Prior to that it appears that this had been broken since
> > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> > added the following conditional which resulted in slices never being
> > expired.
> >
> > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> >       /* extend local deadline, drift is bounded above by 2 ticks */
> >       cfs_rq->runtime_expires += TICK_NSEC;
> >
> > Because this was broken for nearly 5 years, and has recently been fixed
> > and is now being noticed by many users running kubernetes
> > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> > that the mechanisms around expiring runtime should be removed
> > altogether.
> >
> > This allows quota already allocated to per-cpu run-queues to live longer
> > than the period boundary. This allows threads on runqueues that do not
> > use much CPU to continue to use their remaining slice over a longer
> > period of time than cpu.cfs_period_us. However, this helps prevent the
> > above condition of hitting throttling while also not fully utilizing
> > your cpu quota.
> >
> > This theoretically allows a machine to use slightly more than its
> > allotted quota in some periods. This overflow would be bounded by the
> > remaining quota left on each per-cpu runqueueu. This is typically no
> > more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
> > change nothing, as they should theoretically fully utilize all of their
> > quota in each period. For user-interactive tasks as described above this
> > provides a much better user/application experience as their cpu
> > utilization will more closely match the amount they requested when they
> > hit throttling. This means that cpu limits no longer strictly apply per
> > period for non-cpu bound applications, but that they are still accurate
> > over longer timeframes.
> >
> > This greatly improves performance of high-thread-count, non-cpu bound
> > applications with low cfs_quota_us allocation on high-core-count
> > machines. In the case of an artificial testcase (10ms/100ms of quota on
> > 80 CPU machine), this commit resulted in almost 30x performance
> > improvement, while still maintaining correct cpu quota restrictions.
> > That testcase is available at https://github.com/indeedeng/fibtest.
> >
> > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
> > Reviewed-by: Ben Segall <bsegall@google.com>
>
> This still works for me. The documentation reads pretty well, too. Good job.
>
> Feel free to add my Acked-by: or Reviewed-by: Phil Auld <pauld@redhat.com>.
>
> I'll run it through some more tests when I have time. The code is the same
> as the earlier one I tested from what I can see.
>
> Cheers,
> Phil
>
> > ---
> >  Documentation/scheduler/sched-bwc.rst | 74 ++++++++++++++++++++++++++++-------
> >  kernel/sched/fair.c                   | 72 ++++------------------------------
> >  kernel/sched/sched.h                  |  4 --
> >  3 files changed, 67 insertions(+), 83 deletions(-)
> >
> > diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
> > index 3a90642..9801d6b 100644
> > --- a/Documentation/scheduler/sched-bwc.rst
> > +++ b/Documentation/scheduler/sched-bwc.rst
> > @@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
> >  specification of the maximum CPU bandwidth available to a group or hierarchy.
> >
> >  The bandwidth allowed for a group is specified using a quota and period. Within
> > -each given "period" (microseconds), a group is allowed to consume only up to
> > -"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
> > -group exceeds this limit (for that period), the tasks belonging to its
> > -hierarchy will be throttled and are not allowed to run again until the next
> > -period.
> > -
> > -A group's unused runtime is globally tracked, being refreshed with quota units
> > -above at each period boundary.  As threads consume this bandwidth it is
> > -transferred to cpu-local "silos" on a demand basis.  The amount transferred
> > +each given "period" (microseconds), a task group is allocated up to "quota"
> > +microseconds of CPU time. That quota is assigned to per-cpu run queues in
> > +slices as threads in the cgroup become runnable. Once all quota has been
> > +assigned any additional requests for quota will result in those threads being
> > +throttled. Throttled threads will not be able to run again until the next
> > +period when the quota is replenished.
> > +
> > +A group's unassigned quota is globally tracked, being refreshed back to
> > +cfs_quota units at each period boundary. As threads consume this bandwidth it
> > +is transferred to cpu-local "silos" on a demand basis. The amount transferred
> >  within each of these updates is tunable and described as the "slice".
> >
> >  Management
> > @@ -35,12 +36,12 @@ The default values are::
> >
> >  A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
> >  bandwidth restriction in place, such a group is described as an unconstrained
> > -bandwidth group.  This represents the traditional work-conserving behavior for
> > +bandwidth group. This represents the traditional work-conserving behavior for
> >  CFS.
> >
> >  Writing any (valid) positive value(s) will enact the specified bandwidth limit.
> > -The minimum quota allowed for the quota or period is 1ms.  There is also an
> > -upper bound on the period length of 1s.  Additional restrictions exist when
> > +The minimum quota allowed for the quota or period is 1ms. There is also an
> > +upper bound on the period length of 1s. Additional restrictions exist when
> >  bandwidth limits are used in a hierarchical fashion, these are explained in
> >  more detail below.
> >
> > @@ -53,8 +54,8 @@ unthrottled if it is in a constrained state.
> >  System wide settings
> >  --------------------
> >  For efficiency run-time is transferred between the global pool and CPU local
> > -"silos" in a batch fashion.  This greatly reduces global accounting pressure
> > -on large systems.  The amount transferred each time such an update is required
> > +"silos" in a batch fashion. This greatly reduces global accounting pressure
> > +on large systems. The amount transferred each time such an update is required
> >  is described as the "slice".
> >
> >  This is tunable via procfs::
> > @@ -97,6 +98,51 @@ There are two ways in which a group may become throttled:
> >  In case b) above, even though the child may have runtime remaining it will not
> >  be allowed to until the parent's runtime is refreshed.
> >
> > +CFS Bandwidth Quota Caveats
> > +---------------------------
> > +Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
> > +the slice may be returned to the global pool if all threads on that cpu become
> > +unrunnable. This is configured at compile time by the min_cfs_rq_runtime
> > +variable. This is a performance tweak that helps prevent added contention on
> > +the global lock.
> > +
> > +The fact that cpu-local slices do not expire results in some interesting corner
> > +cases that should be understood.
> > +
> > +For cgroup cpu constrained applications that are cpu limited this is a
> > +relatively moot point because they will naturally consume the entirety of their
> > +quota as well as the entirety of each cpu-local slice in each period. As a
> > +result it is expected that nr_periods roughly equal nr_throttled, and that
> > +cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
> > +
> > +For highly-threaded, non-cpu bound applications this non-expiration nuance
> > +allows applications to briefly burst past their quota limits by the amount of
> > +unused slice on each cpu that the task group is running on (typically at most
> > +1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
> > +applies if quota had been assigned to a cpu and then not fully used or returned
> > +in previous periods. This burst amount will not be transferred between cores.
> > +As a result, this mechanism still strictly limits the task group to quota
> > +average usage, albeit over a longer time window than a single period.  This
> > +also limits the burst ability to no more than 1ms per cpu.  This provides
> > +better more predictable user experience for highly threaded applications with
> > +small quota limits on high core count machines. It also eliminates the
> > +propensity to throttle these applications while simultanously using less than
> > +quota amounts of cpu. Another way to say this, is that by allowing the unused
> > +portion of a slice to remain valid across periods we have decreased the
> > +possibility of wastefully expiring quota on cpu-local silos that don't need a
> > +full slice's amount of cpu time.
> > +
> > +The interaction between cpu-bound and non-cpu-bound-interactive applications
> > +should also be considered, especially when single core usage hits 100%. If you
> > +gave each of these applications half of a cpu-core and they both got scheduled
> > +on the same CPU it is theoretically possible that the non-cpu bound application
> > +will use up to 1ms additional quota in some periods, thereby preventing the
> > +cpu-bound application from fully using its quota by that same amount. In these
> > +instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
> > +decide which application is chosen to run, as they will both be runnable and
> > +have remaining quota. This runtime discrepancy will be made up in the following
> > +periods when the interactive application idles.
> > +
> >  Examples
> >  --------
> >  1. Limit a group to 1 CPU worth of runtime::
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 036be95..00b68f0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4316,8 +4316,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
> >
> >       now = sched_clock_cpu(smp_processor_id());
> >       cfs_b->runtime = cfs_b->quota;
> > -     cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> > -     cfs_b->expires_seq++;
> >  }
> >
> >  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> > @@ -4339,8 +4337,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >  {
> >       struct task_group *tg = cfs_rq->tg;
> >       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> > -     u64 amount = 0, min_amount, expires;
> > -     int expires_seq;
> > +     u64 amount = 0, min_amount;
> >
> >       /* note: this is a positive sum as runtime_remaining <= 0 */
> >       min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> > @@ -4357,61 +4354,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >                       cfs_b->idle = 0;
> >               }
> >       }
> > -     expires_seq = cfs_b->expires_seq;
> > -     expires = cfs_b->runtime_expires;
> >       raw_spin_unlock(&cfs_b->lock);
> >
> >       cfs_rq->runtime_remaining += amount;
> > -     /*
> > -      * we may have advanced our local expiration to account for allowed
> > -      * spread between our sched_clock and the one on which runtime was
> > -      * issued.
> > -      */
> > -     if (cfs_rq->expires_seq != expires_seq) {
> > -             cfs_rq->expires_seq = expires_seq;
> > -             cfs_rq->runtime_expires = expires;
> > -     }
> >
> >       return cfs_rq->runtime_remaining > 0;
> >  }
> >
> > -/*
> > - * Note: This depends on the synchronization provided by sched_clock and the
> > - * fact that rq->clock snapshots this value.
> > - */
> > -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > -{
> > -     struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> > -
> > -     /* if the deadline is ahead of our clock, nothing to do */
> > -     if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> > -             return;
> > -
> > -     if (cfs_rq->runtime_remaining < 0)
> > -             return;
> > -
> > -     /*
> > -      * If the local deadline has passed we have to consider the
> > -      * possibility that our sched_clock is 'fast' and the global deadline
> > -      * has not truly expired.
> > -      *
> > -      * Fortunately we can check determine whether this the case by checking
> > -      * whether the global deadline(cfs_b->expires_seq) has advanced.
> > -      */
> > -     if (cfs_rq->expires_seq == cfs_b->expires_seq) {
> > -             /* extend local deadline, drift is bounded above by 2 ticks */
> > -             cfs_rq->runtime_expires += TICK_NSEC;
> > -     } else {
> > -             /* global deadline is ahead, expiration has passed */
> > -             cfs_rq->runtime_remaining = 0;
> > -     }
> > -}
> > -
> >  static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
> >  {
> >       /* dock delta_exec before expiring quota (as it could span periods) */
> >       cfs_rq->runtime_remaining -= delta_exec;
> > -     expire_cfs_rq_runtime(cfs_rq);
> >
> >       if (likely(cfs_rq->runtime_remaining > 0))
> >               return;
> > @@ -4602,8 +4555,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> >               resched_curr(rq);
> >  }
> >
> > -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> > -             u64 remaining, u64 expires)
> > +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
> >  {
> >       struct cfs_rq *cfs_rq;
> >       u64 runtime;
> > @@ -4625,7 +4577,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> >               remaining -= runtime;
> >
> >               cfs_rq->runtime_remaining += runtime;
> > -             cfs_rq->runtime_expires = expires;
> >
> >               /* we check whether we're throttled above */
> >               if (cfs_rq->runtime_remaining > 0)
> > @@ -4650,7 +4601,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> >   */
> >  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
> >  {
> > -     u64 runtime, runtime_expires;
> > +     u64 runtime;
> >       int throttled;
> >
> >       /* no need to continue the timer with no bandwidth constraint */
> > @@ -4678,8 +4629,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> >       /* account preceding periods in which throttling occurred */
> >       cfs_b->nr_throttled += overrun;
> >
> > -     runtime_expires = cfs_b->runtime_expires;
> > -
> >       /*
> >        * This check is repeated as we are holding onto the new bandwidth while
> >        * we unthrottle. This can potentially race with an unthrottled group
> > @@ -4692,8 +4641,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> >               cfs_b->distribute_running = 1;
> >               raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> >               /* we can't nest cfs_b->lock while distributing bandwidth */
> > -             runtime = distribute_cfs_runtime(cfs_b, runtime,
> > -                                              runtime_expires);
> > +             runtime = distribute_cfs_runtime(cfs_b, runtime);
> >               raw_spin_lock_irqsave(&cfs_b->lock, flags);
> >
> >               cfs_b->distribute_running = 0;
> > @@ -4775,8 +4723,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >               return;
> >
> >       raw_spin_lock(&cfs_b->lock);
> > -     if (cfs_b->quota != RUNTIME_INF &&
> > -         cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> > +     if (cfs_b->quota != RUNTIME_INF) {
> >               cfs_b->runtime += slack_runtime;
> >
> >               /* we are under rq->lock, defer unthrottling using a timer */
> > @@ -4809,7 +4756,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> >  {
> >       u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> >       unsigned long flags;
> > -     u64 expires;
> >
> >       /* confirm we're still not at a refresh boundary */
> >       raw_spin_lock_irqsave(&cfs_b->lock, flags);
> > @@ -4827,7 +4773,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> >       if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
> >               runtime = cfs_b->runtime;
> >
> > -     expires = cfs_b->runtime_expires;
> >       if (runtime)
> >               cfs_b->distribute_running = 1;
> >
> > @@ -4836,11 +4781,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> >       if (!runtime)
> >               return;
> >
> > -     runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> > +     runtime = distribute_cfs_runtime(cfs_b, runtime);
> >
> >       raw_spin_lock_irqsave(&cfs_b->lock, flags);
> > -     if (expires == cfs_b->runtime_expires)
> > -             lsub_positive(&cfs_b->runtime, runtime);
> > +     lsub_positive(&cfs_b->runtime, runtime);
> >       cfs_b->distribute_running = 0;
> >       raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> >  }
> > @@ -4997,8 +4941,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> >
> >       cfs_b->period_active = 1;
> >       overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> > -     cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> > -     cfs_b->expires_seq++;
> >       hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> >  }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 802b1f3..28c16e9 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -335,8 +335,6 @@ struct cfs_bandwidth {
> >       u64                     quota;
> >       u64                     runtime;
> >       s64                     hierarchical_quota;
> > -     u64                     runtime_expires;
> > -     int                     expires_seq;
> >
> >       u8                      idle;
> >       u8                      period_active;
> > @@ -556,8 +554,6 @@ struct cfs_rq {
> >
> >  #ifdef CONFIG_CFS_BANDWIDTH
> >       int                     runtime_enabled;
> > -     int                     expires_seq;
> > -     u64                     runtime_expires;
> >       s64                     runtime_remaining;
> >
> >       u64                     throttled_clock;
> > --
> > 1.8.3.1
> >
>
> --

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-07-23 22:12       ` Dave Chiluk
@ 2019-07-23 23:26         ` Phil Auld
  0 siblings, 0 replies; 40+ messages in thread
From: Phil Auld @ 2019-07-23 23:26 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Ben Segall, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc

On Tue, Jul 23, 2019 at 05:12:18PM -0500 Dave Chiluk wrote:
> Thanks for all the help and testing you provided.  It's good to know
> these changes have passed at least some scheduler regression tests.
> If it comes to a v7 I'll add the Reviewed-by, otherwise I'll just let
> Peter add it.
>

Sounds good.

> Will you be handling the backport into the RHEL 8 kernels?  I'll
> submit this to Ubuntu and linux-stable once it gets accepted.
>

Yep.

> Thanks again,

Thank you :)


Cheers,
Phil

> 
> 
> On Tue, Jul 23, 2019 at 12:13 PM Phil Auld <pauld@redhat.com> wrote:
> >
> > Hi Dave,
> >
> > On Tue, Jul 23, 2019 at 11:44:26AM -0500 Dave Chiluk wrote:
> > > It has been observed, that highly-threaded, non-cpu-bound applications
> > > running under cpu.cfs_quota_us constraints can hit a high percentage of
> > > periods throttled while simultaneously not consuming the allocated
> > > amount of quota. This use case is typical of user-interactive non-cpu
> > > bound applications, such as those running in kubernetes or mesos when
> > > run on multiple cpu cores.
> > >
> > > This has been root caused to cpu-local run queue being allocated per cpu
> > > bandwidth slices, and then not fully using that slice within the period.
> > > At which point the slice and quota expires. This expiration of unused
> > > slice results in applications not being able to utilize the quota for
> > > which they are allocated.
> > >
> > > The non-expiration of per-cpu slices was recently fixed by
> > > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> > > condition")'. Prior to that it appears that this had been broken since
> > > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> > > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> > > added the following conditional which resulted in slices never being
> > > expired.
> > >
> > > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> > >       /* extend local deadline, drift is bounded above by 2 ticks */
> > >       cfs_rq->runtime_expires += TICK_NSEC;
> > >
> > > Because this was broken for nearly 5 years, and has recently been fixed
> > > and is now being noticed by many users running kubernetes
> > > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> > > that the mechanisms around expiring runtime should be removed
> > > altogether.
> > >
> > > This allows quota already allocated to per-cpu run-queues to live longer
> > > than the period boundary. This allows threads on runqueues that do not
> > > use much CPU to continue to use their remaining slice over a longer
> > > period of time than cpu.cfs_period_us. However, this helps prevent the
> > > above condition of hitting throttling while also not fully utilizing
> > > your cpu quota.
> > >
> > > This theoretically allows a machine to use slightly more than its
> > > allotted quota in some periods. This overflow would be bounded by the
> > > remaining quota left on each per-cpu runqueueu. This is typically no
> > > more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
> > > change nothing, as they should theoretically fully utilize all of their
> > > quota in each period. For user-interactive tasks as described above this
> > > provides a much better user/application experience as their cpu
> > > utilization will more closely match the amount they requested when they
> > > hit throttling. This means that cpu limits no longer strictly apply per
> > > period for non-cpu bound applications, but that they are still accurate
> > > over longer timeframes.
> > >
> > > This greatly improves performance of high-thread-count, non-cpu bound
> > > applications with low cfs_quota_us allocation on high-core-count
> > > machines. In the case of an artificial testcase (10ms/100ms of quota on
> > > 80 CPU machine), this commit resulted in almost 30x performance
> > > improvement, while still maintaining correct cpu quota restrictions.
> > > That testcase is available at https://github.com/indeedeng/fibtest.
> > >
> > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> > > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
> > > Reviewed-by: Ben Segall <bsegall@google.com>
> >
> > This still works for me. The documentation reads pretty well, too. Good job.
> >
> > Feel free to add my Acked-by: or Reviewed-by: Phil Auld <pauld@redhat.com>.
> >
> > I'll run it through some more tests when I have time. The code is the same
> > as the earlier one I tested from what I can see.
> >
> > Cheers,
> > Phil
> >
> > > ---
> > >  Documentation/scheduler/sched-bwc.rst | 74 ++++++++++++++++++++++++++++-------
> > >  kernel/sched/fair.c                   | 72 ++++------------------------------
> > >  kernel/sched/sched.h                  |  4 --
> > >  3 files changed, 67 insertions(+), 83 deletions(-)
> > >
> > > diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
> > > index 3a90642..9801d6b 100644
> > > --- a/Documentation/scheduler/sched-bwc.rst
> > > +++ b/Documentation/scheduler/sched-bwc.rst
> > > @@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
> > >  specification of the maximum CPU bandwidth available to a group or hierarchy.
> > >
> > >  The bandwidth allowed for a group is specified using a quota and period. Within
> > > -each given "period" (microseconds), a group is allowed to consume only up to
> > > -"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
> > > -group exceeds this limit (for that period), the tasks belonging to its
> > > -hierarchy will be throttled and are not allowed to run again until the next
> > > -period.
> > > -
> > > -A group's unused runtime is globally tracked, being refreshed with quota units
> > > -above at each period boundary.  As threads consume this bandwidth it is
> > > -transferred to cpu-local "silos" on a demand basis.  The amount transferred
> > > +each given "period" (microseconds), a task group is allocated up to "quota"
> > > +microseconds of CPU time. That quota is assigned to per-cpu run queues in
> > > +slices as threads in the cgroup become runnable. Once all quota has been
> > > +assigned any additional requests for quota will result in those threads being
> > > +throttled. Throttled threads will not be able to run again until the next
> > > +period when the quota is replenished.
> > > +
> > > +A group's unassigned quota is globally tracked, being refreshed back to
> > > +cfs_quota units at each period boundary. As threads consume this bandwidth it
> > > +is transferred to cpu-local "silos" on a demand basis. The amount transferred
> > >  within each of these updates is tunable and described as the "slice".
> > >
> > >  Management
> > > @@ -35,12 +36,12 @@ The default values are::
> > >
> > >  A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
> > >  bandwidth restriction in place, such a group is described as an unconstrained
> > > -bandwidth group.  This represents the traditional work-conserving behavior for
> > > +bandwidth group. This represents the traditional work-conserving behavior for
> > >  CFS.
> > >
> > >  Writing any (valid) positive value(s) will enact the specified bandwidth limit.
> > > -The minimum quota allowed for the quota or period is 1ms.  There is also an
> > > -upper bound on the period length of 1s.  Additional restrictions exist when
> > > +The minimum quota allowed for the quota or period is 1ms. There is also an
> > > +upper bound on the period length of 1s. Additional restrictions exist when
> > >  bandwidth limits are used in a hierarchical fashion, these are explained in
> > >  more detail below.
> > >
> > > @@ -53,8 +54,8 @@ unthrottled if it is in a constrained state.
> > >  System wide settings
> > >  --------------------
> > >  For efficiency run-time is transferred between the global pool and CPU local
> > > -"silos" in a batch fashion.  This greatly reduces global accounting pressure
> > > -on large systems.  The amount transferred each time such an update is required
> > > +"silos" in a batch fashion. This greatly reduces global accounting pressure
> > > +on large systems. The amount transferred each time such an update is required
> > >  is described as the "slice".
> > >
> > >  This is tunable via procfs::
> > > @@ -97,6 +98,51 @@ There are two ways in which a group may become throttled:
> > >  In case b) above, even though the child may have runtime remaining it will not
> > >  be allowed to until the parent's runtime is refreshed.
> > >
> > > +CFS Bandwidth Quota Caveats
> > > +---------------------------
> > > +Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
> > > +the slice may be returned to the global pool if all threads on that cpu become
> > > +unrunnable. This is configured at compile time by the min_cfs_rq_runtime
> > > +variable. This is a performance tweak that helps prevent added contention on
> > > +the global lock.
> > > +
> > > +The fact that cpu-local slices do not expire results in some interesting corner
> > > +cases that should be understood.
> > > +
> > > +For cgroup cpu constrained applications that are cpu limited this is a
> > > +relatively moot point because they will naturally consume the entirety of their
> > > +quota as well as the entirety of each cpu-local slice in each period. As a
> > > +result it is expected that nr_periods roughly equal nr_throttled, and that
> > > +cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
> > > +
> > > +For highly-threaded, non-cpu bound applications this non-expiration nuance
> > > +allows applications to briefly burst past their quota limits by the amount of
> > > +unused slice on each cpu that the task group is running on (typically at most
> > > +1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
> > > +applies if quota had been assigned to a cpu and then not fully used or returned
> > > +in previous periods. This burst amount will not be transferred between cores.
> > > +As a result, this mechanism still strictly limits the task group to quota
> > > +average usage, albeit over a longer time window than a single period.  This
> > > +also limits the burst ability to no more than 1ms per cpu.  This provides
> > > +better more predictable user experience for highly threaded applications with
> > > +small quota limits on high core count machines. It also eliminates the
> > > +propensity to throttle these applications while simultanously using less than
> > > +quota amounts of cpu. Another way to say this, is that by allowing the unused
> > > +portion of a slice to remain valid across periods we have decreased the
> > > +possibility of wastefully expiring quota on cpu-local silos that don't need a
> > > +full slice's amount of cpu time.
> > > +
> > > +The interaction between cpu-bound and non-cpu-bound-interactive applications
> > > +should also be considered, especially when single core usage hits 100%. If you
> > > +gave each of these applications half of a cpu-core and they both got scheduled
> > > +on the same CPU it is theoretically possible that the non-cpu bound application
> > > +will use up to 1ms additional quota in some periods, thereby preventing the
> > > +cpu-bound application from fully using its quota by that same amount. In these
> > > +instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
> > > +decide which application is chosen to run, as they will both be runnable and
> > > +have remaining quota. This runtime discrepancy will be made up in the following
> > > +periods when the interactive application idles.
> > > +
> > >  Examples
> > >  --------
> > >  1. Limit a group to 1 CPU worth of runtime::
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 036be95..00b68f0 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4316,8 +4316,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
> > >
> > >       now = sched_clock_cpu(smp_processor_id());
> > >       cfs_b->runtime = cfs_b->quota;
> > > -     cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> > > -     cfs_b->expires_seq++;
> > >  }
> > >
> > >  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> > > @@ -4339,8 +4337,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > >  {
> > >       struct task_group *tg = cfs_rq->tg;
> > >       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> > > -     u64 amount = 0, min_amount, expires;
> > > -     int expires_seq;
> > > +     u64 amount = 0, min_amount;
> > >
> > >       /* note: this is a positive sum as runtime_remaining <= 0 */
> > >       min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> > > @@ -4357,61 +4354,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > >                       cfs_b->idle = 0;
> > >               }
> > >       }
> > > -     expires_seq = cfs_b->expires_seq;
> > > -     expires = cfs_b->runtime_expires;
> > >       raw_spin_unlock(&cfs_b->lock);
> > >
> > >       cfs_rq->runtime_remaining += amount;
> > > -     /*
> > > -      * we may have advanced our local expiration to account for allowed
> > > -      * spread between our sched_clock and the one on which runtime was
> > > -      * issued.
> > > -      */
> > > -     if (cfs_rq->expires_seq != expires_seq) {
> > > -             cfs_rq->expires_seq = expires_seq;
> > > -             cfs_rq->runtime_expires = expires;
> > > -     }
> > >
> > >       return cfs_rq->runtime_remaining > 0;
> > >  }
> > >
> > > -/*
> > > - * Note: This depends on the synchronization provided by sched_clock and the
> > > - * fact that rq->clock snapshots this value.
> > > - */
> > > -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > > -{
> > > -     struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> > > -
> > > -     /* if the deadline is ahead of our clock, nothing to do */
> > > -     if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> > > -             return;
> > > -
> > > -     if (cfs_rq->runtime_remaining < 0)
> > > -             return;
> > > -
> > > -     /*
> > > -      * If the local deadline has passed we have to consider the
> > > -      * possibility that our sched_clock is 'fast' and the global deadline
> > > -      * has not truly expired.
> > > -      *
> > > -      * Fortunately we can check determine whether this the case by checking
> > > -      * whether the global deadline(cfs_b->expires_seq) has advanced.
> > > -      */
> > > -     if (cfs_rq->expires_seq == cfs_b->expires_seq) {
> > > -             /* extend local deadline, drift is bounded above by 2 ticks */
> > > -             cfs_rq->runtime_expires += TICK_NSEC;
> > > -     } else {
> > > -             /* global deadline is ahead, expiration has passed */
> > > -             cfs_rq->runtime_remaining = 0;
> > > -     }
> > > -}
> > > -
> > >  static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
> > >  {
> > >       /* dock delta_exec before expiring quota (as it could span periods) */
> > >       cfs_rq->runtime_remaining -= delta_exec;
> > > -     expire_cfs_rq_runtime(cfs_rq);
> > >
> > >       if (likely(cfs_rq->runtime_remaining > 0))
> > >               return;
> > > @@ -4602,8 +4555,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > >               resched_curr(rq);
> > >  }
> > >
> > > -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> > > -             u64 remaining, u64 expires)
> > > +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
> > >  {
> > >       struct cfs_rq *cfs_rq;
> > >       u64 runtime;
> > > @@ -4625,7 +4577,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> > >               remaining -= runtime;
> > >
> > >               cfs_rq->runtime_remaining += runtime;
> > > -             cfs_rq->runtime_expires = expires;
> > >
> > >               /* we check whether we're throttled above */
> > >               if (cfs_rq->runtime_remaining > 0)
> > > @@ -4650,7 +4601,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> > >   */
> > >  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
> > >  {
> > > -     u64 runtime, runtime_expires;
> > > +     u64 runtime;
> > >       int throttled;
> > >
> > >       /* no need to continue the timer with no bandwidth constraint */
> > > @@ -4678,8 +4629,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> > >       /* account preceding periods in which throttling occurred */
> > >       cfs_b->nr_throttled += overrun;
> > >
> > > -     runtime_expires = cfs_b->runtime_expires;
> > > -
> > >       /*
> > >        * This check is repeated as we are holding onto the new bandwidth while
> > >        * we unthrottle. This can potentially race with an unthrottled group
> > > @@ -4692,8 +4641,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> > >               cfs_b->distribute_running = 1;
> > >               raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> > >               /* we can't nest cfs_b->lock while distributing bandwidth */
> > > -             runtime = distribute_cfs_runtime(cfs_b, runtime,
> > > -                                              runtime_expires);
> > > +             runtime = distribute_cfs_runtime(cfs_b, runtime);
> > >               raw_spin_lock_irqsave(&cfs_b->lock, flags);
> > >
> > >               cfs_b->distribute_running = 0;
> > > @@ -4775,8 +4723,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > >               return;
> > >
> > >       raw_spin_lock(&cfs_b->lock);
> > > -     if (cfs_b->quota != RUNTIME_INF &&
> > > -         cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> > > +     if (cfs_b->quota != RUNTIME_INF) {
> > >               cfs_b->runtime += slack_runtime;
> > >
> > >               /* we are under rq->lock, defer unthrottling using a timer */
> > > @@ -4809,7 +4756,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> > >  {
> > >       u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> > >       unsigned long flags;
> > > -     u64 expires;
> > >
> > >       /* confirm we're still not at a refresh boundary */
> > >       raw_spin_lock_irqsave(&cfs_b->lock, flags);
> > > @@ -4827,7 +4773,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> > >       if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
> > >               runtime = cfs_b->runtime;
> > >
> > > -     expires = cfs_b->runtime_expires;
> > >       if (runtime)
> > >               cfs_b->distribute_running = 1;
> > >
> > > @@ -4836,11 +4781,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> > >       if (!runtime)
> > >               return;
> > >
> > > -     runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> > > +     runtime = distribute_cfs_runtime(cfs_b, runtime);
> > >
> > >       raw_spin_lock_irqsave(&cfs_b->lock, flags);
> > > -     if (expires == cfs_b->runtime_expires)
> > > -             lsub_positive(&cfs_b->runtime, runtime);
> > > +     lsub_positive(&cfs_b->runtime, runtime);
> > >       cfs_b->distribute_running = 0;
> > >       raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> > >  }
> > > @@ -4997,8 +4941,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > >
> > >       cfs_b->period_active = 1;
> > >       overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> > > -     cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> > > -     cfs_b->expires_seq++;
> > >       hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> > >  }
> > >
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index 802b1f3..28c16e9 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -335,8 +335,6 @@ struct cfs_bandwidth {
> > >       u64                     quota;
> > >       u64                     runtime;
> > >       s64                     hierarchical_quota;
> > > -     u64                     runtime_expires;
> > > -     int                     expires_seq;
> > >
> > >       u8                      idle;
> > >       u8                      period_active;
> > > @@ -556,8 +554,6 @@ struct cfs_rq {
> > >
> > >  #ifdef CONFIG_CFS_BANDWIDTH
> > >       int                     runtime_enabled;
> > > -     int                     expires_seq;
> > > -     u64                     runtime_expires;
> > >       s64                     runtime_remaining;
> > >
> > >       u64                     throttled_clock;
> > > --
> > > 1.8.3.1
> > >
> >
> > --

-- 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-07-23 17:13     ` Phil Auld
  2019-07-23 22:12       ` Dave Chiluk
@ 2019-07-26 18:14       ` Peter Zijlstra
  1 sibling, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2019-07-26 18:14 UTC (permalink / raw)
  To: Phil Auld
  Cc: Dave Chiluk, Ben Segall, Peter Oskolkov, Ingo Molnar, cgroups,
	linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos,
	John Hammond, Cong Wang, Jonathan Corbet, linux-doc

On Tue, Jul 23, 2019 at 01:13:09PM -0400, Phil Auld wrote:
> Hi Dave,
> 
> On Tue, Jul 23, 2019 at 11:44:26AM -0500 Dave Chiluk wrote:
> > It has been observed, that highly-threaded, non-cpu-bound applications
> > running under cpu.cfs_quota_us constraints can hit a high percentage of
> > periods throttled while simultaneously not consuming the allocated
> > amount of quota. This use case is typical of user-interactive non-cpu
> > bound applications, such as those running in kubernetes or mesos when
> > run on multiple cpu cores.
> > 
> > This has been root caused to cpu-local run queue being allocated per cpu
> > bandwidth slices, and then not fully using that slice within the period.
> > At which point the slice and quota expires. This expiration of unused
> > slice results in applications not being able to utilize the quota for
> > which they are allocated.
> > 
> > The non-expiration of per-cpu slices was recently fixed by
> > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> > condition")'. Prior to that it appears that this had been broken since
> > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> > added the following conditional which resulted in slices never being
> > expired.
> > 
> > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> > 	/* extend local deadline, drift is bounded above by 2 ticks */
> > 	cfs_rq->runtime_expires += TICK_NSEC;
> > 
> > Because this was broken for nearly 5 years, and has recently been fixed
> > and is now being noticed by many users running kubernetes
> > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> > that the mechanisms around expiring runtime should be removed
> > altogether.
> > 
> > This allows quota already allocated to per-cpu run-queues to live longer
> > than the period boundary. This allows threads on runqueues that do not
> > use much CPU to continue to use their remaining slice over a longer
> > period of time than cpu.cfs_period_us. However, this helps prevent the
> > above condition of hitting throttling while also not fully utilizing
> > your cpu quota.
> > 
> > This theoretically allows a machine to use slightly more than its
> > allotted quota in some periods. This overflow would be bounded by the
> > remaining quota left on each per-cpu runqueueu. This is typically no
> > more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
> > change nothing, as they should theoretically fully utilize all of their
> > quota in each period. For user-interactive tasks as described above this
> > provides a much better user/application experience as their cpu
> > utilization will more closely match the amount they requested when they
> > hit throttling. This means that cpu limits no longer strictly apply per
> > period for non-cpu bound applications, but that they are still accurate
> > over longer timeframes.
> > 
> > This greatly improves performance of high-thread-count, non-cpu bound
> > applications with low cfs_quota_us allocation on high-core-count
> > machines. In the case of an artificial testcase (10ms/100ms of quota on
> > 80 CPU machine), this commit resulted in almost 30x performance
> > improvement, while still maintaining correct cpu quota restrictions.
> > That testcase is available at https://github.com/indeedeng/fibtest.
> > 
> > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
> > Reviewed-by: Ben Segall <bsegall@google.com>
> 
> This still works for me. The documentation reads pretty well, too. Good job.
> 
> Feel free to add my Acked-by: or Reviewed-by: Phil Auld <pauld@redhat.com>.

Thanks guys!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tip:sched/core] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-07-23 16:44   ` [PATCH v6 1/1] " Dave Chiluk
  2019-07-23 17:13     ` Phil Auld
@ 2019-08-08 10:53     ` tip-bot for Dave Chiluk
  1 sibling, 0 replies; 40+ messages in thread
From: tip-bot for Dave Chiluk @ 2019-08-08 10:53 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: posk, hpa, jhammond, mingo, gmunoz, corbet, tglx, chiluk+linux,
	mingo, kwa, linux-kernel, pauld, peterz, xiyou.wangcong, bsegall,
	bgregg

Commit-ID:  de53fd7aedb100f03e5d2231cfce0e4993282425
Gitweb:     https://git.kernel.org/tip/de53fd7aedb100f03e5d2231cfce0e4993282425
Author:     Dave Chiluk <chiluk+linux@indeed.com>
AuthorDate: Tue, 23 Jul 2019 11:44:26 -0500
Committer:  Peter Zijlstra <peterz@infradead.org>
CommitDate: Thu, 8 Aug 2019 09:09:30 +0200

sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Hammond <jhammond@indeed.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kyle Anderson <kwa@yelp.com>
Cc: Gabriel Munos <gmunoz@netflix.com>
Cc: Peter Oskolkov <posk@posk.io>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Brendan Gregg <bgregg@netflix.com>
Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
---
 Documentation/scheduler/sched-bwc.rst | 74 ++++++++++++++++++++++++++++-------
 kernel/sched/fair.c                   | 72 ++++------------------------------
 kernel/sched/sched.h                  |  4 --
 3 files changed, 67 insertions(+), 83 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
index 3a9064219656..9801d6b284b1 100644
--- a/Documentation/scheduler/sched-bwc.rst
+++ b/Documentation/scheduler/sched-bwc.rst
@@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
 specification of the maximum CPU bandwidth available to a group or hierarchy.
 
 The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary.  As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis.  The amount transferred
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. That quota is assigned to per-cpu run queues in
+slices as threads in the cgroup become runnable. Once all quota has been
+assigned any additional requests for quota will result in those threads being
+throttled. Throttled threads will not be able to run again until the next
+period when the quota is replenished.
+
+A group's unassigned quota is globally tracked, being refreshed back to
+cfs_quota units at each period boundary. As threads consume this bandwidth it
+is transferred to cpu-local "silos" on a demand basis. The amount transferred
 within each of these updates is tunable and described as the "slice".
 
 Management
@@ -35,12 +36,12 @@ The default values are::
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
-bandwidth group.  This represents the traditional work-conserving behavior for
+bandwidth group. This represents the traditional work-conserving behavior for
 CFS.
 
 Writing any (valid) positive value(s) will enact the specified bandwidth limit.
-The minimum quota allowed for the quota or period is 1ms.  There is also an
-upper bound on the period length of 1s.  Additional restrictions exist when
+The minimum quota allowed for the quota or period is 1ms. There is also an
+upper bound on the period length of 1s. Additional restrictions exist when
 bandwidth limits are used in a hierarchical fashion, these are explained in
 more detail below.
 
@@ -53,8 +54,8 @@ unthrottled if it is in a constrained state.
 System wide settings
 --------------------
 For efficiency run-time is transferred between the global pool and CPU local
-"silos" in a batch fashion.  This greatly reduces global accounting pressure
-on large systems.  The amount transferred each time such an update is required
+"silos" in a batch fashion. This greatly reduces global accounting pressure
+on large systems. The amount transferred each time such an update is required
 is described as the "slice".
 
 This is tunable via procfs::
@@ -97,6 +98,51 @@ There are two ways in which a group may become throttled:
 In case b) above, even though the child may have runtime remaining it will not
 be allowed to until the parent's runtime is refreshed.
 
+CFS Bandwidth Quota Caveats
+---------------------------
+Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
+the slice may be returned to the global pool if all threads on that cpu become
+unrunnable. This is configured at compile time by the min_cfs_rq_runtime
+variable. This is a performance tweak that helps prevent added contention on
+the global lock.
+
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+For highly-threaded, non-cpu bound applications this non-expiration nuance
+allows applications to briefly burst past their quota limits by the amount of
+unused slice on each cpu that the task group is running on (typically at most
+1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
+applies if quota had been assigned to a cpu and then not fully used or returned
+in previous periods. This burst amount will not be transferred between cores.
+As a result, this mechanism still strictly limits the task group to quota
+average usage, albeit over a longer time window than a single period.  This
+also limits the burst ability to no more than 1ms per cpu.  This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to remain valid across periods we have decreased the
+possibility of wastefully expiring quota on cpu-local silos that don't need a
+full slice's amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to 1ms additional quota in some periods, thereby preventing the
+cpu-bound application from fully using its quota by that same amount. In these
+instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
+decide which application is chosen to run, as they will both be runnable and
+have remaining quota. This runtime discrepancy will be made up in the following
+periods when the interactive application idles.
+
 Examples
 --------
 1. Limit a group to 1 CPU worth of runtime::
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fb75c0bea80f..7d8043fc8317 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4371,8 +4371,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4394,8 +4392,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount, expires;
-	int expires_seq;
+	u64 amount = 0, min_amount;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4412,61 +4409,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
-	expires_seq = cfs_b->expires_seq;
-	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
-	/*
-	 * we may have advanced our local expiration to account for allowed
-	 * spread between our sched_clock and the one on which runtime was
-	 * issued.
-	 */
-	if (cfs_rq->expires_seq != expires_seq) {
-		cfs_rq->expires_seq = expires_seq;
-		cfs_rq->runtime_expires = expires;
-	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
-		return;
-
-	if (cfs_rq->runtime_remaining < 0)
-		return;
-
-	/*
-	 * If the local deadline has passed we have to consider the
-	 * possibility that our sched_clock is 'fast' and the global deadline
-	 * has not truly expired.
-	 *
-	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline(cfs_b->expires_seq) has advanced.
-	 */
-	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
-		/* extend local deadline, drift is bounded above by 2 ticks */
-		cfs_rq->runtime_expires += TICK_NSEC;
-	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
-	}
-}
-
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	expire_cfs_rq_runtime(cfs_rq);
 
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
@@ -4661,8 +4614,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		resched_curr(rq);
 }
 
-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
-		u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime;
@@ -4684,7 +4636,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		remaining -= runtime;
 
 		cfs_rq->runtime_remaining += runtime;
-		cfs_rq->runtime_expires = expires;
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0)
@@ -4709,7 +4660,7 @@ next:
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
-	u64 runtime, runtime_expires;
+	u64 runtime;
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
@@ -4737,8 +4688,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	/* account preceding periods in which throttling occurred */
 	cfs_b->nr_throttled += overrun;
 
-	runtime_expires = cfs_b->runtime_expires;
-
 	/*
 	 * This check is repeated as we are holding onto the new bandwidth while
 	 * we unthrottle. This can potentially race with an unthrottled group
@@ -4751,8 +4700,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 		cfs_b->distribute_running = 1;
 		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
+		runtime = distribute_cfs_runtime(cfs_b, runtime);
 		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
@@ -4834,8 +4782,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
@@ -4868,7 +4815,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
 	unsigned long flags;
-	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4886,7 +4832,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	expires = cfs_b->runtime_expires;
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
@@ -4895,11 +4840,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (!runtime)
 		return;
 
-	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+	runtime = distribute_cfs_runtime(cfs_b, runtime);
 
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
-	if (expires == cfs_b->runtime_expires)
-		lsub_positive(&cfs_b->runtime, runtime);
+	lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
 	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
@@ -5056,8 +5000,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 
 	cfs_b->period_active = 1;
 	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7583faddba33..ea48aa5daeee 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -335,8 +335,6 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	s64			hierarchical_quota;
-	u64			runtime_expires;
-	int			expires_seq;
 
 	u8			idle;
 	u8			period_active;
@@ -557,8 +555,6 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
-	int			expires_seq;
-	u64			runtime_expires;
 	s64			runtime_remaining;
 
 	u64			throttled_clock;

^ permalink raw reply related	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2019-08-08 10:54 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk
2019-05-23 18:44 ` [PATCH v2 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk
2019-05-23 18:44   ` [PATCH v2 1/1] " Dave Chiluk
2019-05-23 21:01     ` Peter Oskolkov
2019-05-24 14:32       ` Phil Auld
2019-05-24 15:14         ` Dave Chiluk
2019-05-24 15:59           ` Phil Auld
2019-05-24 16:28           ` Peter Oskolkov
2019-05-24 21:35             ` Dave Chiluk
2019-05-24 22:07               ` Peter Oskolkov
2019-05-28 22:25                 ` Dave Chiluk
2019-05-24  8:55     ` Peter Zijlstra
2019-05-29 19:08 ` [PATCH v3 0/1] " Dave Chiluk
2019-05-29 19:08   ` [PATCH v3 1/1] " Dave Chiluk
2019-05-29 19:28     ` Phil Auld
2019-05-29 19:50     ` bsegall
2019-05-29 21:05     ` bsegall
2019-05-30 17:53       ` Dave Chiluk
2019-05-30 20:44         ` bsegall
     [not found] ` <1561391404-14450-1-git-send-email-chiluk+linux@indeed.com>
2019-06-24 15:50   ` [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining Dave Chiluk
2019-06-24 17:33     ` bsegall
2019-06-26 22:10       ` Dave Chiluk
2019-06-27 20:18         ` bsegall
2019-06-27 19:09 ` [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk
2019-06-27 19:49 ` [PATCH v5 0/1] " Dave Chiluk
2019-06-27 19:49   ` [PATCH v5 1/1] " Dave Chiluk
2019-07-01 20:15     ` bsegall
2019-07-11  9:51       ` Peter Zijlstra
2019-07-11 17:46         ` bsegall
     [not found]           ` <CAC=E7cV4sO50NpYOZ06n_BkZTcBqf1KQp83prc+oave3ircBrw@mail.gmail.com>
2019-07-12 18:01             ` bsegall
2019-07-12 22:09             ` bsegall
2019-07-15 15:44               ` Dave Chiluk
2019-07-16 19:58     ` bsegall
2019-07-23 16:44 ` [PATCH v6 0/1] " Dave Chiluk
2019-07-23 16:44   ` [PATCH v6 1/1] " Dave Chiluk
2019-07-23 17:13     ` Phil Auld
2019-07-23 22:12       ` Dave Chiluk
2019-07-23 23:26         ` Phil Auld
2019-07-26 18:14       ` Peter Zijlstra
2019-08-08 10:53     ` [tip:sched/core] " tip-bot for Dave Chiluk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.