All of lore.kernel.org
 help / color / mirror / Atom feed
* Cgroup cpu throttling with low cpu usage of multi-threaded applications on high-core count machines
@ 2019-03-18 14:59 Dave Chiluk
  2019-04-10 22:20 ` [PATCH 0/1] cgroup: Fix low cpu usage with high throttling by removing slice expiration Dave Chiluk
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chiluk @ 2019-03-18 14:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel,
	Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond,
	Cong Wang

We are seeing a high amount of cgroup cpu throttling, as measured by
nr_throttled/nr_periods, while also seeing low cpu usage when running
highly threaded applications on high core-count machines. In
particular we are seeing this on "thread pool" design pattern
applications that are being run on kubernetes cpu with hard limits.
We've seen similar issues on other microservice cloud architectures
that utilize cgroup cpu constraints.  Most information out there about
this problem is to over-commit cpu which is wasteful, or turn off hard
limits and rely on the cpu_shares mechanism instead.

We’ve root caused this to bandwitdth_slices being allocated to
runqueues that own threads that do little work.  This results in the
primary “fast” worker threads being starved for runtime/throttled,
while runtime allocated to the cfs_rq’s of the less productive threads
goes unused.  Eventually the time slices on the less productive thread
expires wasting cpu quota.

This issue is exacerbated even further as you move from 8 core -> 80
core machines, as slices are allocated and left unused to that many
more cfs_rq’s.  With an 80 core machine assigning the default time
slice (5ms) to each cfs_rq requires 400ms quota per 100ms period
simply to allow each cfs_rq to have one time slice which is 4 CPUs
worth of quota.  In reality tasks rarely get spread out to every core
like this, but that is only the worst case scenario.  This is also why
we saw a performance regression when moving from older 46 core
machines to newer 80 core machines.  Now that the world is moving to
micro-services architectures such as kubernetes, more and more
applications are being run with cgroup cpu constraints like this.

I have created an artificial C testcase that reproduces that problem
and have posted it at
https://github.com/indeedeng/fibtest

I have used that testcase to identify 512ac99 as the source of this
performance regression.  However as far as I can tell 512ac99 is
technically a correct patch. Instead what was happening before is that
the runtime on each cfs_rq would almost never be expired because the
following conditional would almost always be true.

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
...
if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
       /* extend local deadline, drift is bounded above by 2 ticks */
       cfs_rq->runtime_expires += TICK_NSEC;
..
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I verified this by adding a variable to the cfs_bandwidth structure
that counted all of the runtime_remaining that would've been expired,
in the else clause of this if and found that it was never hit pre
512ac99 on my test machines.  However after 512ac99 lots of runtime
would be expired. I understand that this experience is different from
the submitters of 512ac99, and I suspect there may be some
architecture or configuration difference at play there.  So looking
back at commit 51f2176d which introduced the above logic, the behavior
appears to have existed since 3.16.

Cong Wang submitted a patch that happens to work around this, by
implementing bursting based on idle time
https://lore.kernel.org/patchwork/patch/907450/.  However beneficial,
that patch is orthogonal to the root cause of this problem, but I
wanted to mention it.

So my question is what should be done?
1. Make the expiration time of a time slice configurable with the
default set to INF to match the behavior of the kernel as it existed
v3.16..v4.18-rc4.
2. Another option that should work afaik would be to remove all the
cfs_bandwidth time slice expiration logic, as time slices naturally
expire as they get used anyways.  This is actually my preferred course
of action as it's most performant, and it removes what appears to be
some very hardware sensitive logic.  Additionally the hard limit
description still holds true albeit not strictly per each accounting
period.  However since no one has complained about that over the 5
years it was broken, I think it's pretty safe to assume that very few
people are actually watching that carefully.  Instead it's a much
worse user experience when you ask for .5 cpu, and are only able to
use .1 of it while being throttled because of time slice expiration.

Thank you,
Dave Chiluk
p.s. I've copied representatives of Netflix and Yelp as well, as we
were talking about this at the Scale 17x conference. There we
discovered we had all individually hit this issue and had it on
roadmaps to fix.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 0/1] cgroup: Fix low cpu usage with high throttling by removing slice expiration
  2019-03-18 14:59 Cgroup cpu throttling with low cpu usage of multi-threaded applications on high-core count machines Dave Chiluk
@ 2019-04-10 22:20 ` Dave Chiluk
  2019-04-10 22:20   ` [PATCH] " Dave Chiluk
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chiluk @ 2019-04-10 22:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel,
	Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond,
	Cong Wang


The following patch is an implementation of option 2 from my earlier e-mail.
Option #2 was the removal of all runtime_expires and related slice expiration
logic.

This is a very early iteration of this patch, and testing/comments are very
appreciated.

Thanks,
Dave.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] cgroup: Fix low cpu usage with high throttling by removing slice expiration
  2019-04-10 22:20 ` [PATCH 0/1] cgroup: Fix low cpu usage with high throttling by removing slice expiration Dave Chiluk
@ 2019-04-10 22:20   ` Dave Chiluk
  2019-05-08  4:26     ` Dave Chiluk
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chiluk @ 2019-04-10 22:20 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel,
	Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond,
	Cong Wang

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota.  This use case is typical of user-interactive non-cpu
bound web services, such as those running in kubernetes or mesos.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period, and then
having that quota expire.  This constant expiration of unused quota
results applications not being able to utilize the quota for which they
are allocated.

The expiration of quota was recently fixed by commit 512ac999d275
("sched/fair: Fix bandwidth timer clock drift condition"). Prior to that
it appears that this has been broken since a least commit 51f2176d74ac
("sched/fair: Fix unlocked reads of some cfs_b->quota/period") which was
introduced in v3.16-rc1 in 2014.  That commit added the following
testcase which resulted in runtime never being expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota runtime slices allocated to per-cpu runqueues to live
longer than the period boundary.  This allows threads on runqueues that
do not use much CPU to continue to use their remaining slice over a
longer period of time than cpu.cfs_period_us. However, this helps
prevents the above condition of hitting throttling while also not fully
utilizing your cpu quota.

This theoretically allows a machine to use slightly more than it's
allotted quota in some periods.  This overflow would be equal to the
amount of quota that was left un-used on cfs_rq's in the previous
period.  For CPU bound tasks this will change nothing, as they should
theoretically fully utilize all of their quota in each period. For
user-interactive tasks as described above this provides a much better
user/application experience as their cpu utilization will more closely
match the amount they requested when they hit throttling.

This greatly improves performance of high-thread-count, interactive
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase, this performance
discrepancy has been observed to be almost 30x performance improvement,
while still maintaining correct cpu quota restrictions albeit over
longer time intervals than cpu.cfs_period_us.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
---
 kernel/sched/fair.c  | 71 +++++-----------------------------------------------
 kernel/sched/sched.h |  4 ---
 2 files changed, 6 insertions(+), 69 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdab7eb..b0c3d76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4291,8 +4291,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4314,8 +4312,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount, expires;
-	int expires_seq;
+	u64 amount = 0, min_amount;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4332,61 +4329,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
-	expires_seq = cfs_b->expires_seq;
-	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
-	/*
-	 * we may have advanced our local expiration to account for allowed
-	 * spread between our sched_clock and the one on which runtime was
-	 * issued.
-	 */
-	if (cfs_rq->expires_seq != expires_seq) {
-		cfs_rq->expires_seq = expires_seq;
-		cfs_rq->runtime_expires = expires;
-	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
-		return;
-
-	if (cfs_rq->runtime_remaining < 0)
-		return;
-
-	/*
-	 * If the local deadline has passed we have to consider the
-	 * possibility that our sched_clock is 'fast' and the global deadline
-	 * has not truly expired.
-	 *
-	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline(cfs_b->expires_seq) has advanced.
-	 */
-	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
-		/* extend local deadline, drift is bounded above by 2 ticks */
-		cfs_rq->runtime_expires += TICK_NSEC;
-	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
-	}
-}
-
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	expire_cfs_rq_runtime(cfs_rq);
 
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
@@ -4577,8 +4530,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		resched_curr(rq);
 }
 
-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
-		u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime;
@@ -4600,7 +4552,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		remaining -= runtime;
 
 		cfs_rq->runtime_remaining += runtime;
-		cfs_rq->runtime_expires = expires;
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0)
@@ -4625,7 +4576,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
-	u64 runtime, runtime_expires;
+	u64 runtime;
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
@@ -4653,8 +4604,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	/* account preceding periods in which throttling occurred */
 	cfs_b->nr_throttled += overrun;
 
-	runtime_expires = cfs_b->runtime_expires;
-
 	/*
 	 * This check is repeated as we are holding onto the new bandwidth while
 	 * we unthrottle. This can potentially race with an unthrottled group
@@ -4667,8 +4616,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 		cfs_b->distribute_running = 1;
 		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
+		runtime = distribute_cfs_runtime(cfs_b, runtime);
 		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
@@ -4745,8 +4693,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
@@ -4779,7 +4726,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
 	unsigned long flags;
-	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4796,7 +4742,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	expires = cfs_b->runtime_expires;
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
@@ -4805,11 +4750,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (!runtime)
 		return;
 
-	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+	runtime = distribute_cfs_runtime(cfs_b, runtime);
 
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
-	if (expires == cfs_b->runtime_expires)
-		lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
 	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
@@ -4940,8 +4883,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 
 	cfs_b->period_active = 1;
 	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efa686e..69d9bf9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,8 +341,6 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	s64			hierarchical_quota;
-	u64			runtime_expires;
-	int			expires_seq;
 
 	short			idle;
 	short			period_active;
@@ -562,8 +560,6 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
-	int			expires_seq;
-	u64			runtime_expires;
 	s64			runtime_remaining;
 
 	u64			throttled_clock;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] cgroup: Fix low cpu usage with high throttling by removing slice expiration
  2019-04-10 22:20   ` [PATCH] " Dave Chiluk
@ 2019-05-08  4:26     ` Dave Chiluk
  0 siblings, 0 replies; 4+ messages in thread
From: Dave Chiluk @ 2019-05-08  4:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel,
	Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond,
	Cong Wang

I'd really appreciate some attention on this.  Should I have marked
the subject as sched: instead?

I heard through some back-channels that there may be concern with the
ability to use more cpu than allocated in a given period.  To that I
say,
#1 The current behavior of an application hitting cpu throttling while
simultaneously accounting for much less cpu time than was allocated is
a very poor user experience. i.e. granted .5 cpu, but only used .1 cpu
while simultaneously hitting throttling.
#2 This has been broken like this since at least 3.16-rc1 which is why
I ripped out most of the logic instead of trying to patch it again.  I
proved this experimentally by adding a counter in
expire_cfs_rq_runtime when runtime is expired.  I can share the
patches if that would help.  That means that user-interactive
applications have been able to over-use quota in a similar manner
since June 2014, and no one has noticed or complained.  Now that the
mechanism is "fixed" people are starting to notice and they are
complaining loudly.  See
https://github.com/kubernetes/kubernetes/issues/67577 and the many
linked tickets to that one.
#3 Even though it's true that you can use more cpu than allocated in a
period, that would require that you under-use quota in previous
periods equal to the overage.  In effect you are still enforcing the
quota requirements albeit over longer time-frames than cfs_period_us
*(I'm amenable to a documentation update to fix this nuance).
Additionally any single cpu run queue can only over-use by as much as
sched_cfs_bandwidth_slice_us which defaults to 5ms.  So other
applications on the same processor will at most be hindered by that
amount.
#4 cpu-bound applications will not be able to over-use in any period,
as the entirety of their quota will be consumed every period.

Your review would be much appreciated.
Thank you,
Dave


On Wed, Apr 10, 2019 at 5:21 PM Dave Chiluk <chiluk+linux@indeed.com> wrote:
>
> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota.  This use case is typical of user-interactive non-cpu
> bound web services, such as those running in kubernetes or mesos.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period, and then
> having that quota expire.  This constant expiration of unused quota
> results applications not being able to utilize the quota for which they
> are allocated.
>
> The expiration of quota was recently fixed by commit 512ac999d275
> ("sched/fair: Fix bandwidth timer clock drift condition"). Prior to that
> it appears that this has been broken since a least commit 51f2176d74ac
> ("sched/fair: Fix unlocked reads of some cfs_b->quota/period") which was
> introduced in v3.16-rc1 in 2014.  That commit added the following
> testcase which resulted in runtime never being expired.
>
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
>         /* extend local deadline, drift is bounded above by 2 ticks */
>         cfs_rq->runtime_expires += TICK_NSEC;
>
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
>
> This allows quota runtime slices allocated to per-cpu runqueues to live
> longer than the period boundary.  This allows threads on runqueues that
> do not use much CPU to continue to use their remaining slice over a
> longer period of time than cpu.cfs_period_us. However, this helps
> prevents the above condition of hitting throttling while also not fully
> utilizing your cpu quota.
>
> This theoretically allows a machine to use slightly more than it's
> allotted quota in some periods.  This overflow would be equal to the
> amount of quota that was left un-used on cfs_rq's in the previous
> period.  For CPU bound tasks this will change nothing, as they should
> theoretically fully utilize all of their quota in each period. For
> user-interactive tasks as described above this provides a much better
> user/application experience as their cpu utilization will more closely
> match the amount they requested when they hit throttling.
>
> This greatly improves performance of high-thread-count, interactive
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase, this performance
> discrepancy has been observed to be almost 30x performance improvement,
> while still maintaining correct cpu quota restrictions albeit over
> longer time intervals than cpu.cfs_period_us.
>
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
> ---
>  kernel/sched/fair.c  | 71 +++++-----------------------------------------------
>  kernel/sched/sched.h |  4 ---
>  2 files changed, 6 insertions(+), 69 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fdab7eb..b0c3d76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4291,8 +4291,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>
>         now = sched_clock_cpu(smp_processor_id());
>         cfs_b->runtime = cfs_b->quota;
> -       cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> -       cfs_b->expires_seq++;
>  }
>
>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> @@ -4314,8 +4312,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  {
>         struct task_group *tg = cfs_rq->tg;
>         struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> -       u64 amount = 0, min_amount, expires;
> -       int expires_seq;
> +       u64 amount = 0, min_amount;
>
>         /* note: this is a positive sum as runtime_remaining <= 0 */
>         min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> @@ -4332,61 +4329,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>                         cfs_b->idle = 0;
>                 }
>         }
> -       expires_seq = cfs_b->expires_seq;
> -       expires = cfs_b->runtime_expires;
>         raw_spin_unlock(&cfs_b->lock);
>
>         cfs_rq->runtime_remaining += amount;
> -       /*
> -        * we may have advanced our local expiration to account for allowed
> -        * spread between our sched_clock and the one on which runtime was
> -        * issued.
> -        */
> -       if (cfs_rq->expires_seq != expires_seq) {
> -               cfs_rq->expires_seq = expires_seq;
> -               cfs_rq->runtime_expires = expires;
> -       }
>
>         return cfs_rq->runtime_remaining > 0;
>  }
>
> -/*
> - * Note: This depends on the synchronization provided by sched_clock and the
> - * fact that rq->clock snapshots this value.
> - */
> -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> -{
> -       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> -
> -       /* if the deadline is ahead of our clock, nothing to do */
> -       if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> -               return;
> -
> -       if (cfs_rq->runtime_remaining < 0)
> -               return;
> -
> -       /*
> -        * If the local deadline has passed we have to consider the
> -        * possibility that our sched_clock is 'fast' and the global deadline
> -        * has not truly expired.
> -        *
> -        * Fortunately we can check determine whether this the case by checking
> -        * whether the global deadline(cfs_b->expires_seq) has advanced.
> -        */
> -       if (cfs_rq->expires_seq == cfs_b->expires_seq) {
> -               /* extend local deadline, drift is bounded above by 2 ticks */
> -               cfs_rq->runtime_expires += TICK_NSEC;
> -       } else {
> -               /* global deadline is ahead, expiration has passed */
> -               cfs_rq->runtime_remaining = 0;
> -       }
> -}
> -
>  static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>  {
>         /* dock delta_exec before expiring quota (as it could span periods) */
>         cfs_rq->runtime_remaining -= delta_exec;
> -       expire_cfs_rq_runtime(cfs_rq);
>
>         if (likely(cfs_rq->runtime_remaining > 0))
>                 return;
> @@ -4577,8 +4530,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>                 resched_curr(rq);
>  }
>
> -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> -               u64 remaining, u64 expires)
> +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>  {
>         struct cfs_rq *cfs_rq;
>         u64 runtime;
> @@ -4600,7 +4552,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>                 remaining -= runtime;
>
>                 cfs_rq->runtime_remaining += runtime;
> -               cfs_rq->runtime_expires = expires;
>
>                 /* we check whether we're throttled above */
>                 if (cfs_rq->runtime_remaining > 0)
> @@ -4625,7 +4576,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>   */
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
>  {
> -       u64 runtime, runtime_expires;
> +       u64 runtime;
>         int throttled;
>
>         /* no need to continue the timer with no bandwidth constraint */
> @@ -4653,8 +4604,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
>         /* account preceding periods in which throttling occurred */
>         cfs_b->nr_throttled += overrun;
>
> -       runtime_expires = cfs_b->runtime_expires;
> -
>         /*
>          * This check is repeated as we are holding onto the new bandwidth while
>          * we unthrottle. This can potentially race with an unthrottled group
> @@ -4667,8 +4616,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
>                 cfs_b->distribute_running = 1;
>                 raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>                 /* we can't nest cfs_b->lock while distributing bandwidth */
> -               runtime = distribute_cfs_runtime(cfs_b, runtime,
> -                                                runtime_expires);
> +               runtime = distribute_cfs_runtime(cfs_b, runtime);
>                 raw_spin_lock_irqsave(&cfs_b->lock, flags);
>
>                 cfs_b->distribute_running = 0;
> @@ -4745,8 +4693,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>                 return;
>
>         raw_spin_lock(&cfs_b->lock);
> -       if (cfs_b->quota != RUNTIME_INF &&
> -           cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> +       if (cfs_b->quota != RUNTIME_INF) {
>                 cfs_b->runtime += slack_runtime;
>
>                 /* we are under rq->lock, defer unthrottling using a timer */
> @@ -4779,7 +4726,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  {
>         u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
>         unsigned long flags;
> -       u64 expires;
>
>         /* confirm we're still not at a refresh boundary */
>         raw_spin_lock_irqsave(&cfs_b->lock, flags);
> @@ -4796,7 +4742,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>         if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
>                 runtime = cfs_b->runtime;
>
> -       expires = cfs_b->runtime_expires;
>         if (runtime)
>                 cfs_b->distribute_running = 1;
>
> @@ -4805,11 +4750,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>         if (!runtime)
>                 return;
>
> -       runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> +       runtime = distribute_cfs_runtime(cfs_b, runtime);
>
>         raw_spin_lock_irqsave(&cfs_b->lock, flags);
> -       if (expires == cfs_b->runtime_expires)
> -               lsub_positive(&cfs_b->runtime, runtime);
>         cfs_b->distribute_running = 0;
>         raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  }
> @@ -4940,8 +4883,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>
>         cfs_b->period_active = 1;
>         overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> -       cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> -       cfs_b->expires_seq++;
>         hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>  }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index efa686e..69d9bf9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -341,8 +341,6 @@ struct cfs_bandwidth {
>         u64                     quota;
>         u64                     runtime;
>         s64                     hierarchical_quota;
> -       u64                     runtime_expires;
> -       int                     expires_seq;
>
>         short                   idle;
>         short                   period_active;
> @@ -562,8 +560,6 @@ struct cfs_rq {
>
>  #ifdef CONFIG_CFS_BANDWIDTH
>         int                     runtime_enabled;
> -       int                     expires_seq;
> -       u64                     runtime_expires;
>         s64                     runtime_remaining;
>
>         u64                     throttled_clock;
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-05-08  4:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-18 14:59 Cgroup cpu throttling with low cpu usage of multi-threaded applications on high-core count machines Dave Chiluk
2019-04-10 22:20 ` [PATCH 0/1] cgroup: Fix low cpu usage with high throttling by removing slice expiration Dave Chiluk
2019-04-10 22:20   ` [PATCH] " Dave Chiluk
2019-05-08  4:26     ` Dave Chiluk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.