linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Oskolkov <posk@posk.io>
To: Dave Chiluk <chiluk+linux@indeed.com>
Cc: Phil Auld <pauld@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	cgroups@vger.kernel.org,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Brendan Gregg <bgregg@netflix.com>, Kyle Anderson <kwa@yelp.com>,
	Gabriel Munos <gmunoz@netflix.com>,
	John Hammond <jhammond@indeed.com>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-doc@vger.kernel.org, Ben Segall <bsegall@google.com>
Subject: Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
Date: Fri, 24 May 2019 09:28:30 -0700	[thread overview]
Message-ID: <CAFTs51Vm258CkDXi_Jj_cGOMotTvhdYR_VW8aUwAUvgistZOFQ@mail.gmail.com> (raw)
In-Reply-To: <CAC=E7cXxsyMLw1PR+8QchTH8FYL7WX6_8LBVdqueR1yjW+VVkQ@mail.gmail.com>

On Fri, May 24, 2019 at 8:15 AM Dave Chiluk <chiluk+linux@indeed.com> wrote:
>
> On Fri, May 24, 2019 at 9:32 AM Phil Auld <pauld@redhat.com> wrote:
> > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:
>
> > > If the machine runs at/close to capacity, won't the overallocation
> > > of the quota to bursty tasks necessarily negatively impact every other
> > > task? Should the "unused" quota be available only on idle CPUs?
> > > (Or maybe this is the behavior achieved here, and only the comment and
> > > the commit message should be fixed...)
> > >
> >
> > It's bounded by the amount left unused from the previous period. So
> > theoretically a process could use almost twice its quota. But then it
> > would have nothing left over in the next period. To repeat it would have
> > to not use any that next period. Over a longer number of periods it's the
> > same amount of CPU usage.
> >
> > I think that is more fair than throttling a process that has never used
> > its full quota.
> >
> > And it removes complexity.
> >
> > Cheers,
> > Phil
>
> Actually it's not even that bad.  The overallocation of quota to a
> bursty task in a period is limited to at most one slice per cpu, and
> that slice must not have been used in the previous periods.  The slice
> size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and
> defaults to 5ms.  If a bursty task goes from underutilizing quota to
> using it's entire quota, it will not be able to burst in the
> subsequent periods.  Therefore in an absolute worst case contrived
> scenario, a bursty task can add at most 5ms to the latency of other
> threads on the same CPU.  I think this worst case 5ms tradeoff is
> entirely worth it.
>
> This does mean that a theoretically a poorly written massively
> threaded application on an 80 core box, that spreads itself onto 80
> cpu run queues, can overutilize it's quota in a period by at most 5ms
> * 80 CPUs in a sincle period (slice * number of runqueues the
> application is running on).  But that means that each of those threads
>  would have had to not be use their quota in a previous period, and it
> also means that the application would have to be carefully written to
> exacerbate this behavior.
>
> Additionally if cpu bound threads underutilize a slice of their quota
> in a period due to the cfs choosing a bursty task to run, they should
> theoretically be able to make it up in the following periods when the
> bursty task is unable to "burst".

OK, so it is indeed possible that CPU bound threads will underutilize a slice
of their quota in a period as a result of this patch. This should probably
be clearly stated in the code comments and in the commit message.

In addition, I believe that although many workloads will indeed be
indifferent to getting their fair share "later", some latency-sensitive
workloads will definitely be negatively affected by this temporary
CPU quota stealing by bursty antagonists. So there should probably be
a way to limit this behavior; for example, by making it tunable
per cgroup.

>
> Please be careful here quota and slice are being treated differently.
> Quota does not roll-over between periods, only slices of quota that
> has already been allocated to per cpu run queues. If you allocate
> 100ms of quota per period to an application, but it only spreads onto
> 3 cpu run queues that means it can in the worst case use 3 x slice
> size = 15ms in periods following underutilization.
>
> So why does this matter.  Well applications that use thread pools
> *(*cough* java *cough*) with lots of tiny little worker threads, tend
> to spread themselves out onto a lot of run queues.  These worker
> threads grab quota slices in order to run, then rarely use all of
> their slice (1 or 2ms out of the 5ms).  This results in those worker
> threads starving the main application of quota, and then expiring the
> remainder of that quota slice on the per-cpu.  Going back to my
> earlier 100ms quota / 80 cpu example.  That means only
> 100ms/cfs_bandwidth_slice_us(5ms) = 20 slices are available in a
> period.  So only 20 out of these 80 cpus ever get a slice allocated to
> them.  By allowing these per-cpu run queues to use their remaining
> slice in following periods these worker threads do not need to be
> allocated additional slice, and thereby the main threads are actually
> able to use the allocated cpu quota.
>
> This can be experienced by running fibtest available at
> https://github.com/indeedeng/fibtest/.
> $ runfibtest 1
> runs a single fast thread taskset to cpu 0
> $ runfibtest 8
> Runs a single fast thread taskset to cpu 0, and 7 slow threads taskset
> to cpus 1-7.  This run is expected to show less iterations, but the
> worse problem is that the cpu usage is far less than the 500ms that it
> should have received.
>
> Thanks for the engagement on this,
> Dave Chiluk

  parent reply	other threads:[~2019-05-24 16:28 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk
2019-05-23 18:44 ` [PATCH v2 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk
2019-05-23 18:44   ` [PATCH v2 1/1] " Dave Chiluk
2019-05-23 21:01     ` Peter Oskolkov
2019-05-24 14:32       ` Phil Auld
2019-05-24 15:14         ` Dave Chiluk
2019-05-24 15:59           ` Phil Auld
2019-05-24 16:28           ` Peter Oskolkov [this message]
2019-05-24 21:35             ` Dave Chiluk
2019-05-24 22:07               ` Peter Oskolkov
2019-05-28 22:25                 ` Dave Chiluk
2019-05-24  8:55     ` Peter Zijlstra
2019-05-29 19:08 ` [PATCH v3 0/1] " Dave Chiluk
2019-05-29 19:08   ` [PATCH v3 1/1] " Dave Chiluk
2019-05-29 19:28     ` Phil Auld
2019-05-29 19:50     ` bsegall
2019-05-29 21:05     ` bsegall
2019-05-30 17:53       ` Dave Chiluk
2019-05-30 20:44         ` bsegall
     [not found] ` <1561391404-14450-1-git-send-email-chiluk+linux@indeed.com>
2019-06-24 15:50   ` [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining Dave Chiluk
2019-06-24 17:33     ` bsegall
2019-06-26 22:10       ` Dave Chiluk
2019-06-27 20:18         ` bsegall
2019-06-27 19:09 ` [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk
2019-06-27 19:49 ` [PATCH v5 0/1] " Dave Chiluk
2019-06-27 19:49   ` [PATCH v5 1/1] " Dave Chiluk
2019-07-01 20:15     ` bsegall
2019-07-11  9:51       ` Peter Zijlstra
2019-07-11 17:46         ` bsegall
     [not found]           ` <CAC=E7cV4sO50NpYOZ06n_BkZTcBqf1KQp83prc+oave3ircBrw@mail.gmail.com>
2019-07-12 18:01             ` bsegall
2019-07-12 22:09             ` bsegall
2019-07-15 15:44               ` Dave Chiluk
2019-07-16 19:58     ` bsegall
2019-07-23 16:44 ` [PATCH v6 0/1] " Dave Chiluk
2019-07-23 16:44   ` [PATCH v6 1/1] " Dave Chiluk
2019-07-23 17:13     ` Phil Auld
2019-07-23 22:12       ` Dave Chiluk
2019-07-23 23:26         ` Phil Auld
2019-07-26 18:14       ` Peter Zijlstra
2019-08-08 10:53     ` [tip:sched/core] " tip-bot for Dave Chiluk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFTs51Vm258CkDXi_Jj_cGOMotTvhdYR_VW8aUwAUvgistZOFQ@mail.gmail.com \
    --to=posk@posk.io \
    --cc=bgregg@netflix.com \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chiluk+linux@indeed.com \
    --cc=corbet@lwn.net \
    --cc=gmunoz@netflix.com \
    --cc=jhammond@indeed.com \
    --cc=kwa@yelp.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).