Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
@ 2019-09-25  5:53 Dave Chiluk
  2019-09-25  6:44 ` Greg KH
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chiluk @ 2019-09-25  5:53 UTC (permalink / raw)
  To: stable

Commit de53fd7aedb1 ("sched/fair: Fix low cpu usage with high
throttling by removing expiration of cpu-local slices") fixes a major
performance issue for containerized clouds such as Kubernetes.

Commit de53fd7aedb1 Fixes commit : 512ac999d275 ("sched/fair: Fix
bandwidth timer clock drift condition").

This should be applied to all stable kernels that applied commit
512ac999d275, and should probably be applied to all others as well.

The issues introduced by these pathes can be read about on the
Kubernetes github.
https://github.com/kubernetes/kubernetes/issues/67577

It may also be prudent to also apply the not yet accepted patch that
fixes some introduced compiler warnings discussed here.
https://lkml.org/lkml/2019/9/18/925

Thank you,
Dave Chiluk

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-09-25  5:53 Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk
@ 2019-09-25  6:44 ` Greg KH
  2019-09-27  6:12   ` Dave Chiluk
  0 siblings, 1 reply; 20+ messages in thread
From: Greg KH @ 2019-09-25  6:44 UTC (permalink / raw)
  To: Dave Chiluk; +Cc: stable

On Wed, Sep 25, 2019 at 12:53:48AM -0500, Dave Chiluk wrote:
> Commit de53fd7aedb1 ("sched/fair: Fix low cpu usage with high
> throttling by removing expiration of cpu-local slices") fixes a major
> performance issue for containerized clouds such as Kubernetes.
> 
> Commit de53fd7aedb1 Fixes commit : 512ac999d275 ("sched/fair: Fix
> bandwidth timer clock drift condition").
> 
> This should be applied to all stable kernels that applied commit
> 512ac999d275, and should probably be applied to all others as well.

As this commit isn't in a released kernel just yet, we should wait to
see what happens when it hits people's machines, right?

Also, always cc: all of the people involved in the patch you are asking
for, so as to get their opinion.  For some reason this patch did not
have a cc: stable tag on it, was that because the developers did not
think it was relevant for stable kernels?

> The issues introduced by these pathes can be read about on the
> Kubernetes github.
> https://github.com/kubernetes/kubernetes/issues/67577
> 
> It may also be prudent to also apply the not yet accepted patch that
> fixes some introduced compiler warnings discussed here.
> https://lkml.org/lkml/2019/9/18/925

So we should wait for this to hit Linus's tree too, right?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-09-25  6:44 ` Greg KH
@ 2019-09-27  6:12   ` Dave Chiluk
  2019-09-27  6:24     ` Greg KH
  2019-09-27 13:13     ` Sasha Levin
  0 siblings, 2 replies; 20+ messages in thread
From: Dave Chiluk @ 2019-09-27  6:12 UTC (permalink / raw)
  To: Greg KH, Ben Segall, Peter Zijlstra, Phil Auld; +Cc: stable

On Wed, Sep 25, 2019 at 1:44 AM Greg KH <greg@kroah.com> wrote:
>
> On Wed, Sep 25, 2019 at 12:53:48AM -0500, Dave Chiluk wrote:
> > Commit de53fd7aedb1 ("sched/fair: Fix low cpu usage with high
> > throttling by removing expiration of cpu-local slices") fixes a major
> > performance issue for containerized clouds such as Kubernetes.
> >
> > Commit de53fd7aedb1 Fixes commit : 512ac999d275 ("sched/fair: Fix
> > bandwidth timer clock drift condition").
> >
> > This should be applied to all stable kernels that applied commit
> > 512ac999d275, and should probably be applied to all others as well.
>
> As this commit isn't in a released kernel just yet, we should wait to
> see what happens when it hits people's machines, right?

I think waiting till 5.4 is released would be irresponsible on this
one.  512ac999 was recently pushed back into many of the distro
kernels via the stable streams.  Additionally every major container
running cloud provider (public and private), is likely hitting this
whether they've noticed or not.  Before 512ac999 we would hit the
issue that resolved once every 10000 application deployments or so *(I
admit some providers appear to have been hit by that more often).
However, the fix implemented by 512ac999 is guaranteed to affect 100%
of cgroup cpu bandwidth constrained applications.

Our java applications are particularly hard hit as java tends to be
very thread happy.   Doing some napkin math, given the roughly 9000
applications we have running in our clouds we've had to over-allocate
each one's CPU quota by roughly .5 cpu to account for this behavior
change (worst case scenario is actually .01 cpu * cores in machine,
but not all of our applications are affected equally).
9000 applications * .5 CPU = 4500 cores
4500 Cores / 88 cores per node = 51 additional machines required to
satisfy the inflated quota requirements.
Given each 88 core machine costs roughly $30k that equates to $1.5M
that this issue is costing us.
We have been able to alleviate this somewhat of this by turning on CPU
overcommit on in the container scheduler. (Deploying more applications
per host than the CPU would generally allow for).  This however leads
to occasional overloaded machines, and regular high response
times/wide response time distributions for applications.

If you have a java or golang application running on a containerized
cloud anywhere chances are you are either paying extra to get the CPU
you need or your application is being throttled before using the quota
you've paid for.  I'm sure other languages could be affected, but
these are the two we've seen that by default generate enough threads
to regularly hit this issue.

All of these problems are leading the kubernetes community to change
the defaulrs away from the cfs bandwidth scheduling mechanisms and
towards other mechanisms or hackish workarounds. *(adjusting periods,
turning off cfs bandwidth, moving to only soft limits).
https://github.com/kubernetes/kubernetes/issues/67577
https://github.com/kubernetes/kubernetes/issues/70585
https://github.com/kubernetes/kubernetes/issues/51135

I appreciate your reluctance to accept this patch, but I strongly
believe pulling this sooner than later is the right thing to do.

> Also, always cc: all of the people involved in the patch you are asking
> for, so as to get their opinion.
Fixed.  I'll submit a change to stable-kernel-rules.rst upon my return
from vacation.

> For some reason this patch did not
> have a cc: stable tag on it, was that because the developers did not
> think it was relevant for stable kernels?

This was my fault, I was unaware I could simply cc: stable to have it
auto-submitted.  I promise I'll be better in the future.

> > It may also be prudent to also apply the not yet accepted patch that
> > fixes some introduced compiler warnings discussed here.
> > https://lkml.org/lkml/2019/9/18/925
>
> So we should wait for this to hit Linus's tree too, right?

This patch contributes nothing materially to de53fd7aedb1, and only
eliminates compiler warnings for unused variables *(which likely get
optimized out anyway).  I'm not sure what the stable policy is with
introduced compiler warnings, so I included the patch submission in
this discussion for completeness.

Thank you,
Dave Chiluk

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-09-27  6:12   ` Dave Chiluk
@ 2019-09-27  6:24     ` Greg KH
  2019-09-27 13:13     ` Sasha Levin
  1 sibling, 0 replies; 20+ messages in thread
From: Greg KH @ 2019-09-27  6:24 UTC (permalink / raw)
  To: Dave Chiluk; +Cc: Ben Segall, Peter Zijlstra, Phil Auld, stable

On Fri, Sep 27, 2019 at 01:12:40AM -0500, Dave Chiluk wrote:
> On Wed, Sep 25, 2019 at 1:44 AM Greg KH <greg@kroah.com> wrote:
> >
> > On Wed, Sep 25, 2019 at 12:53:48AM -0500, Dave Chiluk wrote:
> > > Commit de53fd7aedb1 ("sched/fair: Fix low cpu usage with high
> > > throttling by removing expiration of cpu-local slices") fixes a major
> > > performance issue for containerized clouds such as Kubernetes.
> > >
> > > Commit de53fd7aedb1 Fixes commit : 512ac999d275 ("sched/fair: Fix
> > > bandwidth timer clock drift condition").
> > >
> > > This should be applied to all stable kernels that applied commit
> > > 512ac999d275, and should probably be applied to all others as well.
> >
> > As this commit isn't in a released kernel just yet, we should wait to
> > see what happens when it hits people's machines, right?
> 
> I think waiting till 5.4 is released would be irresponsible on this
> one.  512ac999 was recently pushed back into many of the distro
> kernels via the stable streams.  Additionally every major container
> running cloud provider (public and private), is likely hitting this
> whether they've noticed or not.  Before 512ac999 we would hit the
> issue that resolved once every 10000 application deployments or so *(I
> admit some providers appear to have been hit by that more often).
> However, the fix implemented by 512ac999 is guaranteed to affect 100%
> of cgroup cpu bandwidth constrained applications.
> 
> Our java applications are particularly hard hit as java tends to be
> very thread happy.   Doing some napkin math, given the roughly 9000
> applications we have running in our clouds we've had to over-allocate
> each one's CPU quota by roughly .5 cpu to account for this behavior
> change (worst case scenario is actually .01 cpu * cores in machine,
> but not all of our applications are affected equally).
> 9000 applications * .5 CPU = 4500 cores
> 4500 Cores / 88 cores per node = 51 additional machines required to
> satisfy the inflated quota requirements.
> Given each 88 core machine costs roughly $30k that equates to $1.5M
> that this issue is costing us.
> We have been able to alleviate this somewhat of this by turning on CPU
> overcommit on in the container scheduler. (Deploying more applications
> per host than the CPU would generally allow for).  This however leads
> to occasional overloaded machines, and regular high response
> times/wide response time distributions for applications.
> 
> If you have a java or golang application running on a containerized
> cloud anywhere chances are you are either paying extra to get the CPU
> you need or your application is being throttled before using the quota
> you've paid for.  I'm sure other languages could be affected, but
> these are the two we've seen that by default generate enough threads
> to regularly hit this issue.
> 
> All of these problems are leading the kubernetes community to change
> the defaulrs away from the cfs bandwidth scheduling mechanisms and
> towards other mechanisms or hackish workarounds. *(adjusting periods,
> turning off cfs bandwidth, moving to only soft limits).
> https://github.com/kubernetes/kubernetes/issues/67577
> https://github.com/kubernetes/kubernetes/issues/70585
> https://github.com/kubernetes/kubernetes/issues/51135
> 
> I appreciate your reluctance to accept this patch, but I strongly
> believe pulling this sooner than later is the right thing to do.

I'll defer to the scheduler developers/maintainers here as to when to
take this patch, it's their decision.

> > Also, always cc: all of the people involved in the patch you are asking
> > for, so as to get their opinion.
> Fixed.  I'll submit a change to stable-kernel-rules.rst upon my return
> from vacation.

Great!

> > For some reason this patch did not
> > have a cc: stable tag on it, was that because the developers did not
> > think it was relevant for stable kernels?
> 
> This was my fault, I was unaware I could simply cc: stable to have it
> auto-submitted.  I promise I'll be better in the future.

Even if this was "auto-submitted", it would not show up in a kernel
until it hit a release of Linus's tree (i.e. 5.4-rc1.)  So we would
still be having this discussion :)

> > > It may also be prudent to also apply the not yet accepted patch that
> > > fixes some introduced compiler warnings discussed here.
> > > https://lkml.org/lkml/2019/9/18/925
> >
> > So we should wait for this to hit Linus's tree too, right?
> 
> This patch contributes nothing materially to de53fd7aedb1, and only
> eliminates compiler warnings for unused variables *(which likely get
> optimized out anyway).  I'm not sure what the stable policy is with
> introduced compiler warnings, so I included the patch submission in
> this discussion for completeness.

Adding build warnings is a huge no-no as it has the ability to hide real
problems with backports.  Right now the stable trees build for me with
no, or only 1, warnings, depending on the branch.  That is something
that I want to see continue to happen for obvious reasons.  So including
a patch that we know adds warnings is not good.

It's also a huge hint that it didn't get much, if any, testing in
linux-next as the warning would have shown up there, and hopefully fixed
soon, as part of the original merge.  But I digress...

So, I'll be glad to merge this "now" if I get an ACK from the others.
Otherwise I'll have to wait until Monday.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-09-27  6:12   ` Dave Chiluk
  2019-09-27  6:24     ` Greg KH
@ 2019-09-27 13:13     ` Sasha Levin
  2019-10-03  5:15       ` Dave Chiluk
  1 sibling, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2019-09-27 13:13 UTC (permalink / raw)
  To: Dave Chiluk; +Cc: Greg KH, Ben Segall, Peter Zijlstra, Phil Auld, stable

On Fri, Sep 27, 2019 at 01:12:40AM -0500, Dave Chiluk wrote:
>On Wed, Sep 25, 2019 at 1:44 AM Greg KH <greg@kroah.com> wrote:
>>
>> On Wed, Sep 25, 2019 at 12:53:48AM -0500, Dave Chiluk wrote:
>> > Commit de53fd7aedb1 ("sched/fair: Fix low cpu usage with high
>> > throttling by removing expiration of cpu-local slices") fixes a major
>> > performance issue for containerized clouds such as Kubernetes.
>> >
>> > Commit de53fd7aedb1 Fixes commit : 512ac999d275 ("sched/fair: Fix
>> > bandwidth timer clock drift condition").
>> >
>> > This should be applied to all stable kernels that applied commit
>> > 512ac999d275, and should probably be applied to all others as well.
>>
>> As this commit isn't in a released kernel just yet, we should wait to
>> see what happens when it hits people's machines, right?
>
>I think waiting till 5.4 is released would be irresponsible on this
>one.  512ac999 was recently pushed back into many of the distro

The "released kernel" statement means 5.4-rc1 rather than 5.4, which is
just a few days away.

--
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-09-27 13:13     ` Sasha Levin
@ 2019-10-03  5:15       ` Dave Chiluk
  2019-10-03  6:51         ` Greg KH
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chiluk @ 2019-10-03  5:15 UTC (permalink / raw)
  To: Greg KH
  Cc: Ben Segall, Peter Zijlstra, Phil Auld, stable, Sasha Levin, Ingo Molnar

@Greg KH, Qian Cai's compiler warning fix has now been integrated into
Linus' tree as commit: 763a9ec06c409

Both de53fd7aedb1 and 763a9ec06c40 are now apart of v5.4-rc1.  Can you
please queue up these fixes for backport to all stable kernels.

Thank you,
Dave Chiluk

On Fri, Sep 27, 2019 at 8:13 AM Sasha Levin <sashal@kernel.org> wrote:
>
> On Fri, Sep 27, 2019 at 01:12:40AM -0500, Dave Chiluk wrote:
> >On Wed, Sep 25, 2019 at 1:44 AM Greg KH <greg@kroah.com> wrote:
> >>
> >> On Wed, Sep 25, 2019 at 12:53:48AM -0500, Dave Chiluk wrote:
> >> > Commit de53fd7aedb1 ("sched/fair: Fix low cpu usage with high
> >> > throttling by removing expiration of cpu-local slices") fixes a major
> >> > performance issue for containerized clouds such as Kubernetes.
> >> >
> >> > Commit de53fd7aedb1 Fixes commit : 512ac999d275 ("sched/fair: Fix
> >> > bandwidth timer clock drift condition").
> >> >
> >> > This should be applied to all stable kernels that applied commit
> >> > 512ac999d275, and should probably be applied to all others as well.
> >>
> >> As this commit isn't in a released kernel just yet, we should wait to
> >> see what happens when it hits people's machines, right?
> >
> >I think waiting till 5.4 is released would be irresponsible on this
> >one.  512ac999 was recently pushed back into many of the distro
>
> The "released kernel" statement means 5.4-rc1 rather than 5.4, which is
> just a few days away.
>
> --
> Thanks,
> Sasha

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-10-03  5:15       ` Dave Chiluk
@ 2019-10-03  6:51         ` Greg KH
  2019-10-18 20:23           ` Dave Chiluk
  0 siblings, 1 reply; 20+ messages in thread
From: Greg KH @ 2019-10-03  6:51 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Ben Segall, Peter Zijlstra, Phil Auld, stable, Sasha Levin, Ingo Molnar

On Thu, Oct 03, 2019 at 12:15:02AM -0500, Dave Chiluk wrote:
> @Greg KH, Qian Cai's compiler warning fix has now been integrated into
> Linus' tree as commit: 763a9ec06c409
> 
> Both de53fd7aedb1 and 763a9ec06c40 are now apart of v5.4-rc1.  Can you
> please queue up these fixes for backport to all stable kernels.

I need an ack from the scheduler maintainers that this is ok to do so...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-10-03  6:51         ` Greg KH
@ 2019-10-18 20:23           ` Dave Chiluk
  2019-10-18 20:53             ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chiluk @ 2019-10-18 20:23 UTC (permalink / raw)
  To: Greg KH
  Cc: Ben Segall, Peter Zijlstra, Phil Auld, stable, Sasha Levin, Ingo Molnar

@Ben @Ingo @Peter
Can you please please ack this backport request?

Thank you,
Dave Chiluk

On Thu, Oct 3, 2019 at 1:51 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Oct 03, 2019 at 12:15:02AM -0500, Dave Chiluk wrote:
> > @Greg KH, Qian Cai's compiler warning fix has now been integrated into
> > Linus' tree as commit: 763a9ec06c409
> >
> > Both de53fd7aedb1 and 763a9ec06c40 are now apart of v5.4-rc1.  Can you
> > please queue up these fixes for backport to all stable kernels.
>
> I need an ack from the scheduler maintainers that this is ok to do so...
>
> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-10-18 20:23           ` Dave Chiluk
@ 2019-10-18 20:53             ` Peter Zijlstra
  2019-10-22 14:55               ` Dave Chiluk
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2019-10-18 20:53 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Greg KH, Ben Segall, Phil Auld, stable, Sasha Levin, Ingo Molnar

On Fri, Oct 18, 2019 at 03:23:02PM -0500, Dave Chiluk wrote:
> @Ben @Ingo @Peter
> Can you please please ack this backport request?
> 
> Thank you,
> Dave Chiluk
> 
> On Thu, Oct 3, 2019 at 1:51 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Thu, Oct 03, 2019 at 12:15:02AM -0500, Dave Chiluk wrote:
> > > @Greg KH, Qian Cai's compiler warning fix has now been integrated into
> > > Linus' tree as commit: 763a9ec06c409
> > >
> > > Both de53fd7aedb1 and 763a9ec06c40 are now apart of v5.4-rc1.  Can you
> > > please queue up these fixes for backport to all stable kernels.
> >
> > I need an ack from the scheduler maintainers that this is ok to do so...

Sure I suppose, but what makes this commit special? Don't you normally
take just about anything?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-10-18 20:53             ` Peter Zijlstra
@ 2019-10-22 14:55               ` Dave Chiluk
  2019-11-04 11:08                 ` Greg KH
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chiluk @ 2019-10-22 14:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Greg KH, Ben Segall, Phil Auld, stable, Sasha Levin, Ingo Molnar

On Fri, Oct 18, 2019 at 3:53 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Oct 18, 2019 at 03:23:02PM -0500, Dave Chiluk wrote:
> > @Ben @Ingo @Peter
> > Can you please please ack this backport request?
> >
> > Thank you,
> > Dave Chiluk
> >
> > On Thu, Oct 3, 2019 at 1:51 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Thu, Oct 03, 2019 at 12:15:02AM -0500, Dave Chiluk wrote:
> > > > @Greg KH, Qian Cai's compiler warning fix has now been integrated into
> > > > Linus' tree as commit: 763a9ec06c409
> > > >
> > > > Both de53fd7aedb1 and 763a9ec06c40 are now apart of v5.4-rc1.  Can you
> > > > please queue up these fixes for backport to all stable kernels.
> > >
> > > I need an ack from the scheduler maintainers that this is ok to do so...
>
> Sure I suppose, but what makes this commit special? Don't you normally
> take just about anything?

I think this is more a matter of me being a relatively unknown in the
scheduler space, and Greg is just being responsible as this looks like
a pretty scary fix.

In reality, I probably should have just added "Cc:
stable@vger.kernel.org" to the sign-off area of the initial commit and
this conversation wouldn't have been necessary.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-10-22 14:55               ` Dave Chiluk
@ 2019-11-04 11:08                 ` Greg KH
  2019-11-08 18:06                   ` [PATCH v4.14.y 1/2] " Dave Chiluk
                                     ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Greg KH @ 2019-11-04 11:08 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Peter Zijlstra, Ben Segall, Phil Auld, stable, Sasha Levin, Ingo Molnar

On Tue, Oct 22, 2019 at 09:55:58AM -0500, Dave Chiluk wrote:
> On Fri, Oct 18, 2019 at 3:53 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Oct 18, 2019 at 03:23:02PM -0500, Dave Chiluk wrote:
> > > @Ben @Ingo @Peter
> > > Can you please please ack this backport request?
> > >
> > > Thank you,
> > > Dave Chiluk
> > >
> > > On Thu, Oct 3, 2019 at 1:51 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Thu, Oct 03, 2019 at 12:15:02AM -0500, Dave Chiluk wrote:
> > > > > @Greg KH, Qian Cai's compiler warning fix has now been integrated into
> > > > > Linus' tree as commit: 763a9ec06c409
> > > > >
> > > > > Both de53fd7aedb1 and 763a9ec06c40 are now apart of v5.4-rc1.  Can you
> > > > > please queue up these fixes for backport to all stable kernels.
> > > >
> > > > I need an ack from the scheduler maintainers that this is ok to do so...
> >
> > Sure I suppose, but what makes this commit special? Don't you normally
> > take just about anything?
> 
> I think this is more a matter of me being a relatively unknown in the
> scheduler space, and Greg is just being responsible as this looks like
> a pretty scary fix.
> 
> In reality, I probably should have just added "Cc:
> stable@vger.kernel.org" to the sign-off area of the initial commit and
> this conversation wouldn't have been necessary.

That is very true :)

I just tried to apply this to 5.3, 4.19, and 4.14, and it only applies
cleanly to 5.3.y.  Can you please provide backports for 4.19.y and
4.14.y so that I can queue it up there?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4.14.y 1/2] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-11-04 11:08                 ` Greg KH
@ 2019-11-08 18:06                   ` Dave Chiluk
  2019-11-08 20:15                   ` [PATCH v4.14.y 0/2] Please backport de53fd7aedb1 : " Dave Chiluk
  2019-11-08 20:20                   ` [PATCH v4.19.y " Dave Chiluk
  2 siblings, 0 replies; 20+ messages in thread
From: Dave Chiluk @ 2019-11-08 18:06 UTC (permalink / raw)
  To: Greg KH
  Cc: Peter Zijlstra, Ben Segall, Phil Auld, stable, Sasha Levin,
	Ingo Molnar, Qian Cai

[ Upstream commit de53fd7aedb100f03e5d2231cfce0e4993282425 ]

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Hammond <jhammond@indeed.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kyle Anderson <kwa@yelp.com>
Cc: Gabriel Munos <gmunoz@netflix.com>
Cc: Peter Oskolkov <posk@posk.io>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Brendan Gregg <bgregg@netflix.com>
Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
(cherry picked from commit 25d2df31fd426f34c18227c2cfe2d5e6d14a34fc)
---
 Documentation/scheduler/sched-bwc.txt | 45 ++++++++++++++++++++++
 kernel/sched/fair.c                   | 70 ++++-------------------------------
 kernel/sched/sched.h                  |  4 --
 3 files changed, 52 insertions(+), 67 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..de583fb 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -90,6 +90,51 @@ There are two ways in which a group may become throttled:
 In case b) above, even though the child may have runtime remaining it will not
 be allowed to until the parent's runtime is refreshed.
 
+CFS Bandwidth Quota Caveats
+---------------------------
+Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
+the slice may be returned to the global pool if all threads on that cpu become
+unrunnable. This is configured at compile time by the min_cfs_rq_runtime
+variable. This is a performance tweak that helps prevent added contention on
+the global lock.
+
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+For highly-threaded, non-cpu bound applications this non-expiration nuance
+allows applications to briefly burst past their quota limits by the amount of
+unused slice on each cpu that the task group is running on (typically at most
+1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
+applies if quota had been assigned to a cpu and then not fully used or returned
+in previous periods. This burst amount will not be transferred between cores.
+As a result, this mechanism still strictly limits the task group to quota
+average usage, albeit over a longer time window than a single period.  This
+also limits the burst ability to no more than 1ms per cpu.  This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to remain valid across periods we have decreased the
+possibility of wastefully expiring quota on cpu-local silos that don't need a
+full slice's amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to 1ms additional quota in some periods, thereby preventing the
+cpu-bound application from fully using its quota by that same amount. In these
+instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
+decide which application is chosen to run, as they will both be runnable and
+have remaining quota. This runtime discrepancy will be made up in the following
+periods when the interactive application idles.
+
 Examples
 --------
 1. Limit a group to 1 CPU worth of runtime.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 55a33009..d5c032e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4106,8 +4106,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4129,8 +4127,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount, expires;
-	int expires_seq;
+	u64 amount = 0, min_amount;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4147,61 +4144,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
-	expires_seq = cfs_b->expires_seq;
-	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
-	/*
-	 * we may have advanced our local expiration to account for allowed
-	 * spread between our sched_clock and the one on which runtime was
-	 * issued.
-	 */
-	if (cfs_rq->expires_seq != expires_seq) {
-		cfs_rq->expires_seq = expires_seq;
-		cfs_rq->runtime_expires = expires;
-	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
-		return;
-
-	if (cfs_rq->runtime_remaining < 0)
-		return;
-
-	/*
-	 * If the local deadline has passed we have to consider the
-	 * possibility that our sched_clock is 'fast' and the global deadline
-	 * has not truly expired.
-	 *
-	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline(cfs_b->expires_seq) has advanced.
-	 */
-	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
-		/* extend local deadline, drift is bounded above by 2 ticks */
-		cfs_rq->runtime_expires += TICK_NSEC;
-	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
-	}
-}
-
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	expire_cfs_rq_runtime(cfs_rq);
 
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
@@ -4387,8 +4340,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		resched_curr(rq);
 }
 
-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
-		u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime;
@@ -4413,7 +4365,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		remaining -= runtime;
 
 		cfs_rq->runtime_remaining += runtime;
-		cfs_rq->runtime_expires = expires;
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0)
@@ -4438,7 +4389,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	u64 runtime, runtime_expires;
+	u64 runtime;
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
@@ -4466,8 +4417,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	/* account preceding periods in which throttling occurred */
 	cfs_b->nr_throttled += overrun;
 
-	runtime_expires = cfs_b->runtime_expires;
-
 	/*
 	 * This check is repeated as we are holding onto the new bandwidth while
 	 * we unthrottle. This can potentially race with an unthrottled group
@@ -4480,8 +4429,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 		cfs_b->distribute_running = 1;
 		raw_spin_unlock(&cfs_b->lock);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
+		runtime = distribute_cfs_runtime(cfs_b, runtime);
 		raw_spin_lock(&cfs_b->lock);
 
 		cfs_b->distribute_running = 0;
@@ -4558,8 +4506,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
@@ -4591,7 +4538,6 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
-	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock(&cfs_b->lock);
@@ -4608,7 +4554,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	expires = cfs_b->runtime_expires;
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
@@ -4617,11 +4562,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (!runtime)
 		return;
 
-	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+	runtime = distribute_cfs_runtime(cfs_b, runtime);
 
 	raw_spin_lock(&cfs_b->lock);
-	if (expires == cfs_b->runtime_expires)
-		cfs_b->runtime -= min(runtime, cfs_b->runtime);
+	cfs_b->runtime -= min(runtime, cfs_b->runtime);
 	cfs_b->distribute_running = 0;
 	raw_spin_unlock(&cfs_b->lock);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 452b569..268f560 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -280,8 +280,6 @@ struct cfs_bandwidth {
 	ktime_t period;
 	u64 quota, runtime;
 	s64 hierarchical_quota;
-	u64 runtime_expires;
-	int expires_seq;
 
 	short idle, period_active;
 	struct hrtimer period_timer, slack_timer;
@@ -489,8 +487,6 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int runtime_enabled;
-	int expires_seq;
-	u64 runtime_expires;
 	s64 runtime_remaining;
 
 	u64 throttled_clock, throttled_clock_task;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4.14.y 0/2] Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-11-04 11:08                 ` Greg KH
  2019-11-08 18:06                   ` [PATCH v4.14.y 1/2] " Dave Chiluk
@ 2019-11-08 20:15                   ` Dave Chiluk
  2019-11-08 20:15                     ` [PATCH v4.14.y 1/2] " Dave Chiluk
                                       ` (2 more replies)
  2019-11-08 20:20                   ` [PATCH v4.19.y " Dave Chiluk
  2 siblings, 3 replies; 20+ messages in thread
From: Dave Chiluk @ 2019-11-08 20:15 UTC (permalink / raw)
  To: Greg KH
  Cc: Peter Zijlstra, Ben Segall, Phil Auld, stable, Sasha Levin,
	Ingo Molnar, Qian Cai

Here's the backported patches for 4.14.y.  Logic is basically the same, the
issue was primarily with patch context. The patches are really back-ports and
not cherry-picks because of that, if that's an issue feel free to change the
text description.

[PATCH v4.14.y 1/2] sched/fair: Fix low cpu usage with high throttling by
[PATCH v4.14.y 2/2] sched/fair: Fix -Wunused-but-set-variable warnings


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4.14.y 1/2] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-11-08 20:15                   ` [PATCH v4.14.y 0/2] Please backport de53fd7aedb1 : " Dave Chiluk
@ 2019-11-08 20:15                     ` Dave Chiluk
  2019-11-08 20:15                     ` [PATCH v4.14.y 2/2] sched/fair: Fix -Wunused-but-set-variable warnings Dave Chiluk
  2019-11-11  9:18                     ` [PATCH v4.14.y 0/2] Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Greg KH
  2 siblings, 0 replies; 20+ messages in thread
From: Dave Chiluk @ 2019-11-08 20:15 UTC (permalink / raw)
  To: Greg KH
  Cc: Peter Zijlstra, Ben Segall, Phil Auld, stable, Sasha Levin,
	Ingo Molnar, Qian Cai

[ Upstream commit de53fd7aedb100f03e5d2231cfce0e4993282425 ]

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Hammond <jhammond@indeed.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kyle Anderson <kwa@yelp.com>
Cc: Gabriel Munos <gmunoz@netflix.com>
Cc: Peter Oskolkov <posk@posk.io>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Brendan Gregg <bgregg@netflix.com>
Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
(cherry picked from commit 25d2df31fd426f34c18227c2cfe2d5e6d14a34fc)
---
 Documentation/scheduler/sched-bwc.txt | 45 ++++++++++++++++++++++
 kernel/sched/fair.c                   | 70 ++++-------------------------------
 kernel/sched/sched.h                  |  4 --
 3 files changed, 52 insertions(+), 67 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..de583fb 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -90,6 +90,51 @@ There are two ways in which a group may become throttled:
 In case b) above, even though the child may have runtime remaining it will not
 be allowed to until the parent's runtime is refreshed.
 
+CFS Bandwidth Quota Caveats
+---------------------------
+Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
+the slice may be returned to the global pool if all threads on that cpu become
+unrunnable. This is configured at compile time by the min_cfs_rq_runtime
+variable. This is a performance tweak that helps prevent added contention on
+the global lock.
+
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+For highly-threaded, non-cpu bound applications this non-expiration nuance
+allows applications to briefly burst past their quota limits by the amount of
+unused slice on each cpu that the task group is running on (typically at most
+1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
+applies if quota had been assigned to a cpu and then not fully used or returned
+in previous periods. This burst amount will not be transferred between cores.
+As a result, this mechanism still strictly limits the task group to quota
+average usage, albeit over a longer time window than a single period.  This
+also limits the burst ability to no more than 1ms per cpu.  This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to remain valid across periods we have decreased the
+possibility of wastefully expiring quota on cpu-local silos that don't need a
+full slice's amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to 1ms additional quota in some periods, thereby preventing the
+cpu-bound application from fully using its quota by that same amount. In these
+instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
+decide which application is chosen to run, as they will both be runnable and
+have remaining quota. This runtime discrepancy will be made up in the following
+periods when the interactive application idles.
+
 Examples
 --------
 1. Limit a group to 1 CPU worth of runtime.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 55a33009..d5c032e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4106,8 +4106,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4129,8 +4127,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount, expires;
-	int expires_seq;
+	u64 amount = 0, min_amount;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4147,61 +4144,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
-	expires_seq = cfs_b->expires_seq;
-	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
-	/*
-	 * we may have advanced our local expiration to account for allowed
-	 * spread between our sched_clock and the one on which runtime was
-	 * issued.
-	 */
-	if (cfs_rq->expires_seq != expires_seq) {
-		cfs_rq->expires_seq = expires_seq;
-		cfs_rq->runtime_expires = expires;
-	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
-		return;
-
-	if (cfs_rq->runtime_remaining < 0)
-		return;
-
-	/*
-	 * If the local deadline has passed we have to consider the
-	 * possibility that our sched_clock is 'fast' and the global deadline
-	 * has not truly expired.
-	 *
-	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline(cfs_b->expires_seq) has advanced.
-	 */
-	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
-		/* extend local deadline, drift is bounded above by 2 ticks */
-		cfs_rq->runtime_expires += TICK_NSEC;
-	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
-	}
-}
-
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	expire_cfs_rq_runtime(cfs_rq);
 
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
@@ -4387,8 +4340,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		resched_curr(rq);
 }
 
-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
-		u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime;
@@ -4413,7 +4365,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		remaining -= runtime;
 
 		cfs_rq->runtime_remaining += runtime;
-		cfs_rq->runtime_expires = expires;
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0)
@@ -4438,7 +4389,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	u64 runtime, runtime_expires;
+	u64 runtime;
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
@@ -4466,8 +4417,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	/* account preceding periods in which throttling occurred */
 	cfs_b->nr_throttled += overrun;
 
-	runtime_expires = cfs_b->runtime_expires;
-
 	/*
 	 * This check is repeated as we are holding onto the new bandwidth while
 	 * we unthrottle. This can potentially race with an unthrottled group
@@ -4480,8 +4429,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 		cfs_b->distribute_running = 1;
 		raw_spin_unlock(&cfs_b->lock);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
+		runtime = distribute_cfs_runtime(cfs_b, runtime);
 		raw_spin_lock(&cfs_b->lock);
 
 		cfs_b->distribute_running = 0;
@@ -4558,8 +4506,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
@@ -4591,7 +4538,6 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
-	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock(&cfs_b->lock);
@@ -4608,7 +4554,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	expires = cfs_b->runtime_expires;
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
@@ -4617,11 +4562,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (!runtime)
 		return;
 
-	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+	runtime = distribute_cfs_runtime(cfs_b, runtime);
 
 	raw_spin_lock(&cfs_b->lock);
-	if (expires == cfs_b->runtime_expires)
-		cfs_b->runtime -= min(runtime, cfs_b->runtime);
+	cfs_b->runtime -= min(runtime, cfs_b->runtime);
 	cfs_b->distribute_running = 0;
 	raw_spin_unlock(&cfs_b->lock);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 452b569..268f560 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -280,8 +280,6 @@ struct cfs_bandwidth {
 	ktime_t period;
 	u64 quota, runtime;
 	s64 hierarchical_quota;
-	u64 runtime_expires;
-	int expires_seq;
 
 	short idle, period_active;
 	struct hrtimer period_timer, slack_timer;
@@ -489,8 +487,6 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int runtime_enabled;
-	int expires_seq;
-	u64 runtime_expires;
 	s64 runtime_remaining;
 
 	u64 throttled_clock, throttled_clock_task;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4.14.y 2/2] sched/fair: Fix -Wunused-but-set-variable warnings
  2019-11-08 20:15                   ` [PATCH v4.14.y 0/2] Please backport de53fd7aedb1 : " Dave Chiluk
  2019-11-08 20:15                     ` [PATCH v4.14.y 1/2] " Dave Chiluk
@ 2019-11-08 20:15                     ` Dave Chiluk
  2019-11-11  9:18                     ` [PATCH v4.14.y 0/2] Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Greg KH
  2 siblings, 0 replies; 20+ messages in thread
From: Dave Chiluk @ 2019-11-08 20:15 UTC (permalink / raw)
  To: Greg KH
  Cc: Peter Zijlstra, Ben Segall, Phil Auld, stable, Sasha Levin,
	Ingo Molnar, Qian Cai

From: Qian Cai <cai@lca.pw>

[ Upstream commit 763a9ec06c409dcde2a761aac4bb83ff3938e0b3 ]

Commit:

   de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")

introduced a few compilation warnings:

  kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
  kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
  kernel/sched/fair.c: In function 'start_cfs_bandwidth':
  kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]

Also, __refill_cfs_bandwidth_runtime() does no longer update the
expiration time, so fix the comments accordingly.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Dave Chiluk <chiluk+linux@indeed.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pauld@redhat.com
Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d5c032e..feeb528 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4091,21 +4091,16 @@ static inline u64 sched_cfs_bandwidth_slice(void)
 }
 
 /*
- * Replenish runtime according to assigned quota and update expiration time.
- * We use sched_clock_cpu directly instead of rq->clock to avoid adding
- * additional synchronization around rq->lock.
+ * Replenish runtime according to assigned quota. We use sched_clock_cpu
+ * directly instead of rq->clock to avoid adding additional synchronization
+ * around rq->lock.
  *
  * requires cfs_b->lock
  */
 void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 {
-	u64 now;
-
-	if (cfs_b->quota == RUNTIME_INF)
-		return;
-
-	now = sched_clock_cpu(smp_processor_id());
-	cfs_b->runtime = cfs_b->quota;
+	if (cfs_b->quota != RUNTIME_INF)
+		cfs_b->runtime = cfs_b->quota;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4.19.y 0/2] Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-11-04 11:08                 ` Greg KH
  2019-11-08 18:06                   ` [PATCH v4.14.y 1/2] " Dave Chiluk
  2019-11-08 20:15                   ` [PATCH v4.14.y 0/2] Please backport de53fd7aedb1 : " Dave Chiluk
@ 2019-11-08 20:20                   ` Dave Chiluk
  2019-11-08 20:20                     ` [PATCH v4.19.y 1/2] " Dave Chiluk
                                       ` (2 more replies)
  2 siblings, 3 replies; 20+ messages in thread
From: Dave Chiluk @ 2019-11-08 20:20 UTC (permalink / raw)
  To: Greg KH
  Cc: Peter Zijlstra, Ben Segall, Phil Auld, stable, Sasha Levin,
	Ingo Molnar, Qian Cai

Here's the backported patches for 4.19.y.  Logic is basically the same, the
issue was primarily with patch context. The patches are really back-ports and
not cherry-picks because of that, if that's an issue feel free to change the
text description.

[PATCH v4.19.y 1/2] sched/fair: Fix low cpu usage with high throttling by
[PATCH v4.19.y 2/2] sched/fair: Fix -Wunused-but-set-variable warnings

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4.19.y 1/2] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-11-08 20:20                   ` [PATCH v4.19.y " Dave Chiluk
@ 2019-11-08 20:20                     ` Dave Chiluk
  2019-11-08 20:20                     ` [PATCH v4.19.y 2/2] sched/fair: Fix -Wunused-but-set-variable warnings Dave Chiluk
  2019-11-11  9:19                     ` [PATCH v4.19.y 0/2] Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Greg KH
  2 siblings, 0 replies; 20+ messages in thread
From: Dave Chiluk @ 2019-11-08 20:20 UTC (permalink / raw)
  To: Greg KH
  Cc: Peter Zijlstra, Ben Segall, Phil Auld, stable, Sasha Levin,
	Ingo Molnar, Qian Cai

[ Upstream commit de53fd7aedb100f03e5d2231cfce0e4993282425 ]

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Hammond <jhammond@indeed.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kyle Anderson <kwa@yelp.com>
Cc: Gabriel Munos <gmunoz@netflix.com>
Cc: Peter Oskolkov <posk@posk.io>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Brendan Gregg <bgregg@netflix.com>
Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
---
 Documentation/scheduler/sched-bwc.txt | 45 ++++++++++++++++++++++
 kernel/sched/fair.c                   | 72 ++++-------------------------------
 kernel/sched/sched.h                  |  4 --
 3 files changed, 52 insertions(+), 69 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..de583fb 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -90,6 +90,51 @@ There are two ways in which a group may become throttled:
 In case b) above, even though the child may have runtime remaining it will not
 be allowed to until the parent's runtime is refreshed.
 
+CFS Bandwidth Quota Caveats
+---------------------------
+Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
+the slice may be returned to the global pool if all threads on that cpu become
+unrunnable. This is configured at compile time by the min_cfs_rq_runtime
+variable. This is a performance tweak that helps prevent added contention on
+the global lock.
+
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+For highly-threaded, non-cpu bound applications this non-expiration nuance
+allows applications to briefly burst past their quota limits by the amount of
+unused slice on each cpu that the task group is running on (typically at most
+1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
+applies if quota had been assigned to a cpu and then not fully used or returned
+in previous periods. This burst amount will not be transferred between cores.
+As a result, this mechanism still strictly limits the task group to quota
+average usage, albeit over a longer time window than a single period.  This
+also limits the burst ability to no more than 1ms per cpu.  This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to remain valid across periods we have decreased the
+possibility of wastefully expiring quota on cpu-local silos that don't need a
+full slice's amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to 1ms additional quota in some periods, thereby preventing the
+cpu-bound application from fully using its quota by that same amount. In these
+instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
+decide which application is chosen to run, as they will both be runnable and
+have remaining quota. This runtime discrepancy will be made up in the following
+periods when the interactive application idles.
+
 Examples
 --------
 1. Limit a group to 1 CPU worth of runtime.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 32d2dac..cf0f476 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4320,8 +4320,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4343,8 +4341,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount, expires;
-	int expires_seq;
+	u64 amount = 0, min_amount;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4361,61 +4358,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
-	expires_seq = cfs_b->expires_seq;
-	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
-	/*
-	 * we may have advanced our local expiration to account for allowed
-	 * spread between our sched_clock and the one on which runtime was
-	 * issued.
-	 */
-	if (cfs_rq->expires_seq != expires_seq) {
-		cfs_rq->expires_seq = expires_seq;
-		cfs_rq->runtime_expires = expires;
-	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
-		return;
-
-	if (cfs_rq->runtime_remaining < 0)
-		return;
-
-	/*
-	 * If the local deadline has passed we have to consider the
-	 * possibility that our sched_clock is 'fast' and the global deadline
-	 * has not truly expired.
-	 *
-	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline(cfs_b->expires_seq) has advanced.
-	 */
-	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
-		/* extend local deadline, drift is bounded above by 2 ticks */
-		cfs_rq->runtime_expires += TICK_NSEC;
-	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
-	}
-}
-
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	expire_cfs_rq_runtime(cfs_rq);
 
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
@@ -4600,8 +4553,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		resched_curr(rq);
 }
 
-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
-		u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime;
@@ -4626,7 +4578,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		remaining -= runtime;
 
 		cfs_rq->runtime_remaining += runtime;
-		cfs_rq->runtime_expires = expires;
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0)
@@ -4651,7 +4602,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	u64 runtime, runtime_expires;
+	u64 runtime;
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
@@ -4679,8 +4630,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	/* account preceding periods in which throttling occurred */
 	cfs_b->nr_throttled += overrun;
 
-	runtime_expires = cfs_b->runtime_expires;
-
 	/*
 	 * This check is repeated as we are holding onto the new bandwidth while
 	 * we unthrottle. This can potentially race with an unthrottled group
@@ -4693,8 +4642,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 		cfs_b->distribute_running = 1;
 		raw_spin_unlock(&cfs_b->lock);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
+		runtime = distribute_cfs_runtime(cfs_b, runtime);
 		raw_spin_lock(&cfs_b->lock);
 
 		cfs_b->distribute_running = 0;
@@ -4771,8 +4719,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
@@ -4804,7 +4751,6 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
-	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock(&cfs_b->lock);
@@ -4821,7 +4767,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	expires = cfs_b->runtime_expires;
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
@@ -4830,11 +4775,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (!runtime)
 		return;
 
-	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+	runtime = distribute_cfs_runtime(cfs_b, runtime);
 
 	raw_spin_lock(&cfs_b->lock);
-	if (expires == cfs_b->runtime_expires)
-		cfs_b->runtime -= min(runtime, cfs_b->runtime);
+	cfs_b->runtime -= min(runtime, cfs_b->runtime);
 	cfs_b->distribute_running = 0;
 	raw_spin_unlock(&cfs_b->lock);
 }
@@ -4989,8 +4933,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 
 	cfs_b->period_active = 1;
 	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9a7c3d0..62058fd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -334,8 +334,6 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	s64			hierarchical_quota;
-	u64			runtime_expires;
-	int			expires_seq;
 
 	short			idle;
 	short			period_active;
@@ -555,8 +553,6 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
-	int			expires_seq;
-	u64			runtime_expires;
 	s64			runtime_remaining;
 
 	u64			throttled_clock;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4.19.y 2/2] sched/fair: Fix -Wunused-but-set-variable warnings
  2019-11-08 20:20                   ` [PATCH v4.19.y " Dave Chiluk
  2019-11-08 20:20                     ` [PATCH v4.19.y 1/2] " Dave Chiluk
@ 2019-11-08 20:20                     ` Dave Chiluk
  2019-11-11  9:19                     ` [PATCH v4.19.y 0/2] Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Greg KH
  2 siblings, 0 replies; 20+ messages in thread
From: Dave Chiluk @ 2019-11-08 20:20 UTC (permalink / raw)
  To: Greg KH
  Cc: Peter Zijlstra, Ben Segall, Phil Auld, stable, Sasha Levin,
	Ingo Molnar, Qian Cai

From: Qian Cai <cai@lca.pw>

[ Upstream commit 763a9ec06c409dcde2a761aac4bb83ff3938e0b3 ]

Commit:

   de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")

introduced a few compilation warnings:

  kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
  kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
  kernel/sched/fair.c: In function 'start_cfs_bandwidth':
  kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]

Also, __refill_cfs_bandwidth_runtime() does no longer update the
expiration time, so fix the comments accordingly.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Dave Chiluk <chiluk+linux@indeed.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pauld@redhat.com
Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cf0f476..e5e8f67 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4305,21 +4305,16 @@ static inline u64 sched_cfs_bandwidth_slice(void)
 }
 
 /*
- * Replenish runtime according to assigned quota and update expiration time.
- * We use sched_clock_cpu directly instead of rq->clock to avoid adding
- * additional synchronization around rq->lock.
+ * Replenish runtime according to assigned quota. We use sched_clock_cpu
+ * directly instead of rq->clock to avoid adding additional synchronization
+ * around rq->lock.
  *
  * requires cfs_b->lock
  */
 void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 {
-	u64 now;
-
-	if (cfs_b->quota == RUNTIME_INF)
-		return;
-
-	now = sched_clock_cpu(smp_processor_id());
-	cfs_b->runtime = cfs_b->quota;
+	if (cfs_b->quota != RUNTIME_INF)
+		cfs_b->runtime = cfs_b->quota;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4924,15 +4919,13 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
-	u64 overrun;
-
 	lockdep_assert_held(&cfs_b->lock);
 
 	if (cfs_b->period_active)
 		return;
 
 	cfs_b->period_active = 1;
-	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+	hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4.14.y 0/2] Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-11-08 20:15                   ` [PATCH v4.14.y 0/2] Please backport de53fd7aedb1 : " Dave Chiluk
  2019-11-08 20:15                     ` [PATCH v4.14.y 1/2] " Dave Chiluk
  2019-11-08 20:15                     ` [PATCH v4.14.y 2/2] sched/fair: Fix -Wunused-but-set-variable warnings Dave Chiluk
@ 2019-11-11  9:18                     ` Greg KH
  2 siblings, 0 replies; 20+ messages in thread
From: Greg KH @ 2019-11-11  9:18 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Peter Zijlstra, Ben Segall, Phil Auld, stable, Sasha Levin,
	Ingo Molnar, Qian Cai

On Fri, Nov 08, 2019 at 02:15:55PM -0600, Dave Chiluk wrote:
> Here's the backported patches for 4.14.y.  Logic is basically the same, the
> issue was primarily with patch context. The patches are really back-ports and
> not cherry-picks because of that, if that's an issue feel free to change the
> text description.
> 
> [PATCH v4.14.y 1/2] sched/fair: Fix low cpu usage with high throttling by
> [PATCH v4.14.y 2/2] sched/fair: Fix -Wunused-but-set-variable warnings
> 

Both now applied, thanks!

greg k-h

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4.19.y 0/2] Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
  2019-11-08 20:20                   ` [PATCH v4.19.y " Dave Chiluk
  2019-11-08 20:20                     ` [PATCH v4.19.y 1/2] " Dave Chiluk
  2019-11-08 20:20                     ` [PATCH v4.19.y 2/2] sched/fair: Fix -Wunused-but-set-variable warnings Dave Chiluk
@ 2019-11-11  9:19                     ` Greg KH
  2 siblings, 0 replies; 20+ messages in thread
From: Greg KH @ 2019-11-11  9:19 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Peter Zijlstra, Ben Segall, Phil Auld, stable, Sasha Levin,
	Ingo Molnar, Qian Cai

On Fri, Nov 08, 2019 at 02:20:06PM -0600, Dave Chiluk wrote:
> Here's the backported patches for 4.19.y.  Logic is basically the same, the
> issue was primarily with patch context. The patches are really back-ports and
> not cherry-picks because of that, if that's an issue feel free to change the
> text description.
> 
> [PATCH v4.19.y 1/2] sched/fair: Fix low cpu usage with high throttling by
> [PATCH v4.19.y 2/2] sched/fair: Fix -Wunused-but-set-variable warnings

Both now queued up, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-11-11  9:19 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-25  5:53 Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk
2019-09-25  6:44 ` Greg KH
2019-09-27  6:12   ` Dave Chiluk
2019-09-27  6:24     ` Greg KH
2019-09-27 13:13     ` Sasha Levin
2019-10-03  5:15       ` Dave Chiluk
2019-10-03  6:51         ` Greg KH
2019-10-18 20:23           ` Dave Chiluk
2019-10-18 20:53             ` Peter Zijlstra
2019-10-22 14:55               ` Dave Chiluk
2019-11-04 11:08                 ` Greg KH
2019-11-08 18:06                   ` [PATCH v4.14.y 1/2] " Dave Chiluk
2019-11-08 20:15                   ` [PATCH v4.14.y 0/2] Please backport de53fd7aedb1 : " Dave Chiluk
2019-11-08 20:15                     ` [PATCH v4.14.y 1/2] " Dave Chiluk
2019-11-08 20:15                     ` [PATCH v4.14.y 2/2] sched/fair: Fix -Wunused-but-set-variable warnings Dave Chiluk
2019-11-11  9:18                     ` [PATCH v4.14.y 0/2] Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Greg KH
2019-11-08 20:20                   ` [PATCH v4.19.y " Dave Chiluk
2019-11-08 20:20                     ` [PATCH v4.19.y 1/2] " Dave Chiluk
2019-11-08 20:20                     ` [PATCH v4.19.y 2/2] sched/fair: Fix -Wunused-but-set-variable warnings Dave Chiluk
2019-11-11  9:19                     ` [PATCH v4.19.y 0/2] Please backport de53fd7aedb1 : sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).