All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Aubrey Li <aubrey.li@linux.intel.com>,
	Barry Song <song.bao.hua@hisilicon.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
Date: Wed, 22 Sep 2021 17:05:19 +0200	[thread overview]
Message-ID: <CAKfTPtCEAVP88Y6szo84H6YeN7TXs6JSs8eKrToV-epL+Q3-=w@mail.gmail.com> (raw)
In-Reply-To: <CAKfTPtCxhzz1XgNXM8jaQC2=tGHm0ap88HneUgWTpCSeWVZwsw@mail.gmail.com>

On Wed, 22 Sept 2021 at 16:15, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Wed, 22 Sept 2021 at 15:20, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Wed, Sep 22, 2021 at 07:22:20AM +0200, Mike Galbraith wrote:
> > > On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote:
> > > > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote:
> > > >
> > > >
> > > > > Preemption does rapidly run into diminishing return as load climbs for
> > > > > a lot of loads, but as you know, it's a rather sticky wicket because
> > > > > even when over-committed, preventing light control threads from slicing
> > > > > through (what can be a load's own work crew of) hogs can seriously
> > > > > injure performance.
> > > > >
> > > >
> > > > Turning this into a classic Rob Peter To Pay Paul problem. We don't know
> > > > if there is a light control thread that needs to run or not that affects
> > > > overall performance. It all depends on whether that control thread needs
> > > > to make progress for the overall workload or whether there are a mix of
> > > > workloads resulting in overloading.
> > >
> > > WRT overload, and our good buddies Peter and Paul :) I added...
> > >       if (gran >= sysctl_sched_latency >> 1)
> > >               trace_printk("runnable:%d preempt disabled\n",cfs_rq->nr_running);
> > > ...to watch, and met the below when I.. logged in.
> > >
> > > homer:..debug/tracing # tail -20 trace
> > >                X-2229    [002] d..5.    60.690322: wakeup_gran: runnable:9 preempt disabled
> > >                X-2229    [002] d..5.    60.690325: wakeup_gran: runnable:10 preempt disabled
> > >                X-2229    [002] d..5.    60.690330: wakeup_gran: runnable:11 preempt disabled
> > >                X-2229    [002] d..5.    60.690363: wakeup_gran: runnable:13 preempt disabled
> > >                X-2229    [002] d..5.    60.690377: wakeup_gran: runnable:14 preempt disabled
> > >                X-2229    [002] d..5.    60.690390: wakeup_gran: runnable:15 preempt disabled
> > >                X-2229    [002] d..5.    60.690404: wakeup_gran: runnable:16 preempt disabled
> > >                X-2229    [002] d..5.    60.690425: wakeup_gran: runnable:9 preempt disabled
> > >        ksmserver-2694    [003] d..3.    60.690432: wakeup_gran: runnable:6 preempt disabled
> > >        ksmserver-2694    [003] d..3.    60.690436: wakeup_gran: runnable:7 preempt disabled
> > >                X-2229    [002] d..5.    60.690451: wakeup_gran: runnable:6 preempt disabled
> > >                X-2229    [002] d..5.    60.690465: wakeup_gran: runnable:7 preempt disabled
> > >             kmix-2736    [000] d..3.    60.690491: wakeup_gran: runnable:6 preempt disabled
> > >                X-2229    [004] d..5.    92.889635: wakeup_gran: runnable:6 preempt disabled
> > >                X-2229    [004] d..5.    92.889675: wakeup_gran: runnable:6 preempt disabled
> > >                X-2229    [004] d..5.    92.889863: wakeup_gran: runnable:6 preempt disabled
> > >                X-2229    [004] d..5.    92.889944: wakeup_gran: runnable:6 preempt disabled
> > >                X-2229    [004] d..5.    92.889957: wakeup_gran: runnable:7 preempt disabled
> > >                X-2229    [004] d..5.    92.889968: wakeup_gran: runnable:8 preempt disabled
> > >   QXcbEventQueue-2740    [000] d..4.    92.890025: wakeup_gran: runnable:6 preempt disabled
> > > homer:..debug/tracing
> > >
> > > Watching 'while sleep 1; do clear;tail trace; done' with nothing but a
> > > kbuild running is like watching top.  There's enough stacking during
> > > routine use of my desktop box that it runs into the tick granularity
> > > wall pretty much continuously, so 'overload' may want redefining.
> > >
> >
> > Ok, that's pretty convincing. You didn't mention if there were
> > interactivity glitches but it's possible. This is what I'm currently
> > testing but have no results for yet. It caps wakeup_gran at
> > sysctl_sched_latency.
> >
> > ---8<---
> > sched/fair: Scale wakeup granularity relative to nr_running
> >
> > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved
> > the kernel.sched_wakeup_granularity_ns sysctl under debugfs.  One of the
> > reasons why this sysctl may be used may be for "optimising for throughput",
> > particularly when overloaded. The tool TuneD sometimes alters this for two
> > profiles e.g. "mssql" and "throughput-performance". At least version 2.9
> > does but it changed in master where it also will poke at debugfs instead.
> >
> > Internal parameters like sysctl_sched_wakeup_granularity are scaled
> > based on the number of CPUs due to sysctl_sched_tunable_scaling. For
> > simplicity, the timing figures in this changelog are based on
> > SCHED_TUNABLESCALING_NONE.
>
> This is a bit misleading because the platform that you used to
> highlight the problem has a 7ms sysctl_sched_wakeup_granularity. which
> is far more than your tick which should be 1ms
>
> >
> > During task migration or wakeup, a decision is made on whether
> > to preempt the current task or not. To limit over-scheduled,
> > sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms
> > of runtime before preempting. However, when a domain is heavily overloaded
> > (e.g. hackbench), the degree of over-scheduling is still severe. This is
> > problematic as time is wasted rescheduling tasks that could instead be
> > used by userspace tasks.
> >
> > However, care must be taken. Even if a system is overloaded, there may
> > be high priority threads that must still be able to run. Mike Galbraith
> > explained the contraints as follows;
> >
> >         CFS came about because the O1 scheduler was unfair to the
> >         point it had starvation problems. People pretty much across the
> >         board agreed that a fair scheduler was a much way better way
> >         to go, and CFS was born.  It didn't originally have the sleep
> >         credit business, but had to grow it to become _short term_ fair.
> >         Ingo cut the sleep credit in half because of overscheduling, and
> >         that has worked out pretty well all told.. but now you're pushing
> >         it more in the unfair direction, all the way to extremely unfair
> >         for anything and everything very light.
> >
> >         Fairness isn't the holy grail mind you, and at some point, giving
> >         up on short term fairness certainly isn't crazy, as proven by your
> >         hackbench numbers and other numbers we've seen over the years,
> >         but taking bites out of the 'CF' in the CFS that was born to be a
> >         corner-case killer is.. worrisome.  The other shoe will drop.. it
> >         always does :)
> >
> > This patch scales the wakeup granularity based on the number of running
> > tasks on the CPU relative to
> >
> >         sched_nr_disable_gran = sysctl_sched_latency / sysctl_sched_wakeup_granularity
> >
> > By default, this will allow the wakeup_gran to scale from
> > sysctl_sched_wakeup_granularity up to sysctl_sched_wakeup_granularity up to
> > sysctl_sched_latency depending on the number of running tasks on a cfs_rq.
> > By default, the limit is 6ms.
> >
> > Note that the TuneD throughput-performance profile allows up to 15ms
> > for sysctl_sched_latency (ignoring scaling) but there is no explanation
> > why such a long period was necessary or why sched_latency_ns is also
> > not adjusted. The intent may have been to disable wakeup preemption
> > or it might be an oversight.  An internet search for instances where
> > sysctl_sched_wakeup_granularity parameter are tuned to high values offer
> > either no explanation or a broken one.
> >
> > TODO: Results positive or negative
> >
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  kernel/sched/fair.c     | 64 ++++++++++++++++++++++++++++++++++++++++++-------
> >  kernel/sched/features.h |  6 +++++
> >  2 files changed, 61 insertions(+), 9 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ff69f245b939..5ec3b12039d6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -84,6 +84,14 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity       = 1000000UL;
> >
> >  const_debug unsigned int sysctl_sched_migration_cost   = 500000UL;
> >
> > +/*
> > + * This value is kept at sysctl_sched_latency / sysctl_sched_wakeup_granularity
> > + *
> > + * This influences the decision on whether a waking task can preempt a running
> > + * task.
> > + */
> > +static unsigned int sched_nr_disable_gran = 6;
> > +
> >  int sched_thermal_decay_shift;
> >  static int __init setup_sched_thermal_decay_shift(char *str)
> >  {
> > @@ -627,6 +635,9 @@ int sched_update_scaling(void)
> >         sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency,
> >                                         sysctl_sched_min_granularity);
> >
> > +       sched_nr_disable_gran = DIV_ROUND_UP(sysctl_sched_latency,
> > +                                       sysctl_sched_wakeup_granularity);
> > +
> >  #define WRT_SYSCTL(name) \
> >         (normalized_sysctl_##name = sysctl_##name / (factor))
> >         WRT_SYSCTL(sched_min_granularity);
> > @@ -4511,7 +4522,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  }
> >
> >  static int
> > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
> > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> > +                                               struct sched_entity *se);
> >
> >  /*
> >   * Pick the next process, keeping these things in mind, in this order:
> > @@ -4550,16 +4562,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >                                 second = curr;
> >                 }
> >
> > -               if (second && wakeup_preempt_entity(second, left) < 1)
> > +               if (second && wakeup_preempt_entity(NULL, second, left) < 1)
>
> Why not applying the same policy here ? the tick can also prevent
> current task to move forward
>
> >                         se = second;
> >         }
> >
> > -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> > +       if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) {
> >                 /*
> >                  * Someone really wants this to run. If it's not unfair, run it.
> >                  */
> >                 se = cfs_rq->next;
> > -       } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> > +       } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) {
> >                 /*
> >                  * Prefer last buddy, try to return the CPU to a preempted task.
> >                  */
> > @@ -7044,9 +7056,42 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  }
> >  #endif /* CONFIG_SMP */
> >
> > -static unsigned long wakeup_gran(struct sched_entity *se)
> > +static unsigned long
> > +select_wakeup_gran(struct cfs_rq *cfs_rq)
> > +{
> > +       unsigned int nr_running, threshold;
> > +
> > +       if (!cfs_rq || !sched_feat(SCALE_WAKEUP_GRAN))
> > +               return sysctl_sched_wakeup_granularity;
> > +
> > +       /* !GENTLE_FAIR_SLEEPERS has one overload threshold. */
> > +       if (!sched_feat(GENTLE_FAIR_SLEEPERS)) {
> > +               if (cfs_rq->nr_running <= sched_nr_disable_gran)
>
> cfs_rq->nr_running reflects the number of sched entities in the cfs
> but not the number of running tasks which reflected in h_nr_running
>
> Also do you want to take into account only tasks in this cfs and its
> children or on all cfs on this rq ?
>
> > +                       return sysctl_sched_wakeup_granularity;
> > +
> > +               return sysctl_sched_latency;
> > +       }
> > +
> > +       /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */
> > +       nr_running = cfs_rq->nr_running;
> > +       threshold = sched_nr_disable_gran >> 1;
> > +
> > +       /* No overload. */
> > +       if (nr_running <= threshold)
> > +               return sysctl_sched_wakeup_granularity;
>
> TBH I don't like these "no overload", "light overload" ...  They don't
> have any real meaning apart from that it might work for your platform
> and your hackbench test.
> We already had have people complaining that small cfs task does not
> preempt fast enough curr task as an example
>
> There is no explanation why these values are the correct ones and not
> but are just some random heuristic results and we are trying to remove
> platform heuristic and to not add new
>
> > +
> > +       /* Light overload. */
> > +       if (nr_running <= sched_nr_disable_gran)
> > +               return sysctl_sched_latency >> 1;
> > +
> > +       /* Heavy overload. */
> > +       return sysctl_sched_latency;
>
> Why should a thread without any relationship with the curr,  not
> preempt it because there are already a lot of running threads ? In
> your case, you want hackbench threads to not preempt each others
> because they tries to use same resources so it's probably better to
> let the current one to move forward but that's not a universal policy.
>
> side question: Have you try to change the nice priority which also
> impact whether a thread can preempt curr ?

There were also some discussions about adding a nice_latency prio so a
task/group can be set if they want to be nice with others from a
scheduling latency pov. One open point was to define what it mean to
be nice from a latency pov
>
> > +}
> > +
> > +static unsigned long
> > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > -       unsigned long gran = sysctl_sched_wakeup_granularity;
> > +       unsigned long gran = select_wakeup_gran(cfs_rq);
> >
> >         /*
> >          * Since its curr running now, convert the gran from real-time
> > @@ -7079,14 +7124,15 @@ static unsigned long wakeup_gran(struct sched_entity *se)
> >   *
> >   */
> >  static int
> > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> > +                                               struct sched_entity *se)
> >  {
> >         s64 gran, vdiff = curr->vruntime - se->vruntime;
> >
> >         if (vdiff <= 0)
> >                 return -1;
> >
> > -       gran = wakeup_gran(se);
> > +       gran = wakeup_gran(cfs_rq, se);
> >         if (vdiff > gran)
> >                 return 1;
> >
> > @@ -7191,7 +7237,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> >                 return;
> >
> >         update_curr(cfs_rq_of(se));
> > -       if (wakeup_preempt_entity(se, pse) == 1) {
> > +       if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) {
>
> like for update_curr above, cfs_rq can be wrong because se could have changed
>
> >                 /*
> >                  * Bias pick_next to pick the sched entity that is
> >                  * triggering this preemption.
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index 7f8dace0964c..d041d7023029 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -95,3 +95,9 @@ SCHED_FEAT(LATENCY_WARN, false)
> >
> >  SCHED_FEAT(ALT_PERIOD, true)
> >  SCHED_FEAT(BASE_SLICE, true)
> > +
> > +/*
> > + * Scale sched_wakeup_granularity dynamically based on the number of running
> > + * tasks up to a cap of sysctl_sched_latency.
> > + */
> > +SCHED_FEAT(SCALE_WAKEUP_GRAN, true)

  parent reply	other threads:[~2021-09-22 15:05 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-20 14:26 [PATCH 0/2] Scale wakeup granularity relative to nr_running Mel Gorman
2021-09-20 14:26 ` [PATCH 1/2] sched/fair: Remove redundant lookup of rq in check_preempt_wakeup Mel Gorman
2021-09-21  7:21   ` Vincent Guittot
2021-09-21  7:53     ` Mel Gorman
2021-09-21  8:12       ` Vincent Guittot
2021-09-21  8:21       ` Peter Zijlstra
2021-09-21 10:03         ` Mel Gorman
2021-09-20 14:26 ` [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running Mel Gorman
2021-09-21  3:52   ` Mike Galbraith
2021-09-21  5:50     ` Mike Galbraith
2021-09-21  7:04     ` Mike Galbraith
2021-09-21 10:36     ` Mel Gorman
2021-09-21 12:32       ` Mike Galbraith
2021-09-21 14:03         ` Mel Gorman
2021-10-05  9:24         ` Peter Zijlstra
2021-09-22  5:22       ` Mike Galbraith
2021-09-22 13:20         ` Mel Gorman
2021-09-22 14:04           ` Mike Galbraith
2021-09-22 14:15           ` Vincent Guittot
2021-09-22 15:04             ` Mel Gorman
2021-09-22 16:00               ` Vincent Guittot
2021-09-22 17:38                 ` Mel Gorman
2021-09-22 18:22                   ` Vincent Guittot
2021-09-22 18:57                     ` Mel Gorman
2021-09-23  1:47                     ` Mike Galbraith
2021-09-23  8:40                       ` Vincent Guittot
2021-09-23  9:21                         ` Mike Galbraith
2021-09-23 12:41                           ` Vincent Guittot
2021-09-23 13:14                             ` Mike Galbraith
2021-09-27 11:17                             ` Mel Gorman
2021-09-27 14:17                               ` Mike Galbraith
2021-10-04  8:05                                 ` Mel Gorman
2021-10-04 16:37                                   ` Vincent Guittot
2021-10-05  7:41                                     ` Mel Gorman
2021-09-27 14:19                               ` Vincent Guittot
2021-09-27 15:02                                 ` Mel Gorman
2021-09-23 12:24                         ` Phil Auld
2021-10-05 10:36                           ` Peter Zijlstra
2021-10-05 14:12                             ` Phil Auld
2021-10-05 14:32                               ` Peter Zijlstra
2021-10-05 10:28                     ` Peter Zijlstra
2021-10-05 10:23                   ` Peter Zijlstra
2021-10-05  9:41               ` Peter Zijlstra
2021-09-22 15:05             ` Vincent Guittot [this message]
2021-10-05  9:32           ` Peter Zijlstra
2021-10-03  3:07         ` wakeup_affine_weight() is b0rked - was " Mike Galbraith
2021-10-03  7:34           ` Barry Song
2021-10-03 14:52             ` Mike Galbraith
2021-10-03 21:06               ` Barry Song
2021-10-04  1:49                 ` Mike Galbraith
2021-10-04  4:34             ` Mike Galbraith
2021-10-04  9:06               ` Mike Galbraith
2021-10-05  7:47                 ` Mel Gorman
2021-10-05  8:42                   ` Mike Galbraith
2021-10-05  9:31                     ` Mel Gorman
2021-10-06  6:46                       ` Mike Galbraith
2021-10-08  5:06                       ` Mike Galbraith
2021-09-21  8:03   ` Vincent Guittot
2021-09-21 10:45     ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKfTPtCEAVP88Y6szo84H6YeN7TXs6JSs8eKrToV-epL+Q3-=w@mail.gmail.com' \
    --to=vincent.guittot@linaro.org \
    --cc=aubrey.li@linux.intel.com \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=song.bao.hua@hisilicon.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.