linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chiluk <chiluk+linux@indeed.com>
To: Phil Auld <pauld@redhat.com>
Cc: Peter Oskolkov <posk@posk.io>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	cgroups@vger.kernel.org,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Brendan Gregg <bgregg@netflix.com>, Kyle Anderson <kwa@yelp.com>,
	Gabriel Munos <gmunoz@netflix.com>,
	John Hammond <jhammond@indeed.com>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-doc@vger.kernel.org, Ben Segall <bsegall@google.com>
Subject: Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
Date: Fri, 24 May 2019 10:14:36 -0500	[thread overview]
Message-ID: <CAC=E7cXxsyMLw1PR+8QchTH8FYL7WX6_8LBVdqueR1yjW+VVkQ@mail.gmail.com> (raw)
In-Reply-To: <20190524143204.GB4684@lorien.usersys.redhat.com>

On Fri, May 24, 2019 at 9:32 AM Phil Auld <pauld@redhat.com> wrote:
> On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:

> > If the machine runs at/close to capacity, won't the overallocation
> > of the quota to bursty tasks necessarily negatively impact every other
> > task? Should the "unused" quota be available only on idle CPUs?
> > (Or maybe this is the behavior achieved here, and only the comment and
> > the commit message should be fixed...)
> >
>
> It's bounded by the amount left unused from the previous period. So
> theoretically a process could use almost twice its quota. But then it
> would have nothing left over in the next period. To repeat it would have
> to not use any that next period. Over a longer number of periods it's the
> same amount of CPU usage.
>
> I think that is more fair than throttling a process that has never used
> its full quota.
>
> And it removes complexity.
>
> Cheers,
> Phil

Actually it's not even that bad.  The overallocation of quota to a
bursty task in a period is limited to at most one slice per cpu, and
that slice must not have been used in the previous periods.  The slice
size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and
defaults to 5ms.  If a bursty task goes from underutilizing quota to
using it's entire quota, it will not be able to burst in the
subsequent periods.  Therefore in an absolute worst case contrived
scenario, a bursty task can add at most 5ms to the latency of other
threads on the same CPU.  I think this worst case 5ms tradeoff is
entirely worth it.

This does mean that a theoretically a poorly written massively
threaded application on an 80 core box, that spreads itself onto 80
cpu run queues, can overutilize it's quota in a period by at most 5ms
* 80 CPUs in a sincle period (slice * number of runqueues the
application is running on).  But that means that each of those threads
 would have had to not be use their quota in a previous period, and it
also means that the application would have to be carefully written to
exacerbate this behavior.

Additionally if cpu bound threads underutilize a slice of their quota
in a period due to the cfs choosing a bursty task to run, they should
theoretically be able to make it up in the following periods when the
bursty task is unable to "burst".

Please be careful here quota and slice are being treated differently.
Quota does not roll-over between periods, only slices of quota that
has already been allocated to per cpu run queues. If you allocate
100ms of quota per period to an application, but it only spreads onto
3 cpu run queues that means it can in the worst case use 3 x slice
size = 15ms in periods following underutilization.

So why does this matter.  Well applications that use thread pools
*(*cough* java *cough*) with lots of tiny little worker threads, tend
to spread themselves out onto a lot of run queues.  These worker
threads grab quota slices in order to run, then rarely use all of
their slice (1 or 2ms out of the 5ms).  This results in those worker
threads starving the main application of quota, and then expiring the
remainder of that quota slice on the per-cpu.  Going back to my
earlier 100ms quota / 80 cpu example.  That means only
100ms/cfs_bandwidth_slice_us(5ms) = 20 slices are available in a
period.  So only 20 out of these 80 cpus ever get a slice allocated to
them.  By allowing these per-cpu run queues to use their remaining
slice in following periods these worker threads do not need to be
allocated additional slice, and thereby the main threads are actually
able to use the allocated cpu quota.

This can be experienced by running fibtest available at
https://github.com/indeedeng/fibtest/.
$ runfibtest 1
runs a single fast thread taskset to cpu 0
$ runfibtest 8
Runs a single fast thread taskset to cpu 0, and 7 slow threads taskset
to cpus 1-7.  This run is expected to show less iterations, but the
worse problem is that the cpu usage is far less than the 500ms that it
should have received.

Thanks for the engagement on this,
Dave Chiluk

  reply	other threads:[~2019-05-24 15:15 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk
2019-05-23 18:44 ` [PATCH v2 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk
2019-05-23 18:44   ` [PATCH v2 1/1] " Dave Chiluk
2019-05-23 21:01     ` Peter Oskolkov
2019-05-24 14:32       ` Phil Auld
2019-05-24 15:14         ` Dave Chiluk [this message]
2019-05-24 15:59           ` Phil Auld
2019-05-24 16:28           ` Peter Oskolkov
2019-05-24 21:35             ` Dave Chiluk
2019-05-24 22:07               ` Peter Oskolkov
2019-05-28 22:25                 ` Dave Chiluk
2019-05-24  8:55     ` Peter Zijlstra
2019-05-29 19:08 ` [PATCH v3 0/1] " Dave Chiluk
2019-05-29 19:08   ` [PATCH v3 1/1] " Dave Chiluk
2019-05-29 19:28     ` Phil Auld
2019-05-29 19:50     ` bsegall
2019-05-29 21:05     ` bsegall
2019-05-30 17:53       ` Dave Chiluk
2019-05-30 20:44         ` bsegall
     [not found] ` <1561391404-14450-1-git-send-email-chiluk+linux@indeed.com>
2019-06-24 15:50   ` [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining Dave Chiluk
2019-06-24 17:33     ` bsegall
2019-06-26 22:10       ` Dave Chiluk
2019-06-27 20:18         ` bsegall
2019-06-27 19:09 ` [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk
2019-06-27 19:49 ` [PATCH v5 0/1] " Dave Chiluk
2019-06-27 19:49   ` [PATCH v5 1/1] " Dave Chiluk
2019-07-01 20:15     ` bsegall
2019-07-11  9:51       ` Peter Zijlstra
2019-07-11 17:46         ` bsegall
     [not found]           ` <CAC=E7cV4sO50NpYOZ06n_BkZTcBqf1KQp83prc+oave3ircBrw@mail.gmail.com>
2019-07-12 18:01             ` bsegall
2019-07-12 22:09             ` bsegall
2019-07-15 15:44               ` Dave Chiluk
2019-07-16 19:58     ` bsegall
2019-07-23 16:44 ` [PATCH v6 0/1] " Dave Chiluk
2019-07-23 16:44   ` [PATCH v6 1/1] " Dave Chiluk
2019-07-23 17:13     ` Phil Auld
2019-07-23 22:12       ` Dave Chiluk
2019-07-23 23:26         ` Phil Auld
2019-07-26 18:14       ` Peter Zijlstra
2019-08-08 10:53     ` [tip:sched/core] " tip-bot for Dave Chiluk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAC=E7cXxsyMLw1PR+8QchTH8FYL7WX6_8LBVdqueR1yjW+VVkQ@mail.gmail.com' \
    --to=chiluk+linux@indeed.com \
    --cc=bgregg@netflix.com \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=gmunoz@netflix.com \
    --cc=jhammond@indeed.com \
    --cc=kwa@yelp.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=posk@posk.io \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).