From: Qais Yousef <qais.yousef@arm.com>
To: Patrick Bellasi <patrick.bellasi@matbug.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
Mel Gorman <mgorman@suse.de>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
Randy Dunlap <rdunlap@infradead.org>,
Jonathan Corbet <corbet@lwn.net>,
Juri Lelli <juri.lelli@redhat.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>,
Luis Chamberlain <mcgrof@kernel.org>,
Kees Cook <keescook@chromium.org>,
Iurii Zaikin <yzaikin@google.com>,
Quentin Perret <qperret@google.com>,
Valentin Schneider <valentin.schneider@arm.com>,
Pavan Kondeti <pkondeti@codeaurora.org>,
linux-doc@vger.kernel.org,
linux-kernel <linux-kernel@vger.kernel.org>,
linux-fs <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
Date: Fri, 5 Jun 2020 12:32:04 +0100 [thread overview]
Message-ID: <20200605113204.srskjrunz2ezkcuj@e107158-lin.cambridge.arm.com> (raw)
In-Reply-To: <875zc60ww2.derkling@matbug.net>
On 06/05/20 09:55, Patrick Bellasi wrote:
>
> Hi Qais,
>
> On Wed, Jun 03, 2020 at 18:52:00 +0200, Qais Yousef <qais.yousef@arm.com> wrote...
>
> > On 06/03/20 16:59, Vincent Guittot wrote:
> >> When I want to stress the fast path i usually use "perf bench sched pipe -T "
> >> The tip/sched/core on my arm octo core gives the following results for
> >> 20 iterations of perf bench sched pipe -T -l 50000
> >>
> >> all uclamp config disabled 50035.4(+/- 0.334%)
> >> all uclamp config enabled 48749.8(+/- 0.339%) -2.64%
>
> I use to run the same test but I don't remember such big numbers :/
Yeah I remember you ran a lot of testing on this.
>
> >> It's quite easy to reproduce and probably easier to study the impact
> >
> > Thanks Vincent. This is very useful!
> >
> > I could reproduce that on my Juno.
> >
> > One of the codepath I was suspecting seems to affect it.
> >
> >
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0464569f26a7..9f48090eb926 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1063,10 +1063,12 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
> > * e.g. due to future modification, warn and fixup the expected value.
> > */
> > SCHED_WARN_ON(bucket->value > rq_clamp);
> > +#if 0
> > if (bucket->value >= rq_clamp) {
> > bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
> > WRITE_ONCE(uc_rq->value, bkt_clamp);
> > }
> > +#endif
>
> Yep, that's likely where we have most of the overhead at dequeue time,
> sine _sometimes_ we need to update the cpu's clamp value.
>
> However, while running perf sched pipe, I expect:
> - all tasks to have the same clamp value
> - all CPUs to have _always_ at least one RUNNABLE task
>
> Given these two conditions above, if the CPU is never "CFS idle" (i.e.
> without RUNNABLE CFS tasks), the code above should never be triggered.
> More on that later...
So the cost is only incurred by idle cpus is what you're saying.
>
> > }
> >
> > static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> >
> >
> >
> > uclamp_rq_max_value() could be expensive as it loops over all buckets.
>
> It loops over UCLAMP_CNT values which are defined to fit into a single
I think you meant to say UCLAMP_BUCKETS which is defined 5 by default.
> $L. That was the optimal space/time complexity compromise we found to
> get the MAX of a set of values.
It actually covers two cachelines, see below and my other email to Mel.
>
> > Commenting this whole path out strangely doesn't just 'fix' it,
> > but produces better results to no-uclamp kernel :-/
> >
> > # ./perf bench -r 20 sched pipe -T -l 50000
> > Without uclamp: 5039
> > With uclamp: 4832
> > With uclamp+patch: 5729
>
> I explain it below: with that code removed you never decrease the CPU's
> uclamp value. Thus, the first time you schedule an RT task you go to MAX
> OPP and stay there forever.
Okay.
>
> > It might be because schedutil gets biased differently by uclamp..? If I move to
> > performance governor these numbers almost double.
> >
> > I don't know. But this promoted me to look closer and
>
> Just to resume, when a task is dequeued we can have only these cases:
>
> - uclamp(task) < uclamp(cpu):
> this happens when the task was co-scheduled with other tasks with
> higher clamp values which are still RUNNABLE.
> In this case there are no uclamp(cpu) updates.
>
> - uclamp(task) == uclamp(cpu):
> this happens when the task was one of the tasks defining the current
> uclamp(cpu) value, which is defined to track the MAX of the RUNNABLE
> tasks clamp values.
>
> In this last case we _not_ always need to do a uclamp(cpu) update.
> Indeed the update is required _only_ when that task was _the last_ task
> defining the current uclamp(cpu) value.
>
> In this case we use uclamp_rq_max_value() to do a linear scan of
> UCLAMP_CNT values which fits into a single cache line.
Again, I think you mean UCLAMP_BUCKETS here. Unless I missed something, they
span 2 cahcelines on 64bit machines and 64b cacheline size.
To be specific, I am referring to struct uclamp_rq, which defines an array of
size UCLAMP_BUCKETS of type struct uclamp_bucket.
uclamp_rq_max_value() scans the buckets for a given clamp_id (UCLAMP_MIN or
UCLAMP_MAX).
So sizeof(struct uclamp_rq) = 8 * 5 + 4 = 44; on 64bit machines.
And actually the compiler introduces a 4 bytes hole, so we end up with a total
of 48 bytes.
In struct rq, we define struct uclamp_rq as an array of UCLAMP_CNT which is 2.
So by default we have 2 * sizeof(struct uclamp_rq) = 96 bytes.
>
> > I think I spotted a bug where in the if condition we check for '>='
> > instead of '>', causing us to take the supposedly impossible fail safe
> > path.
>
> The fail safe path is when the '>' condition matches, which is what the
> SCHED_WARN_ON tell us. Indeed, we never expect uclamp(cpu) to be bigger
> than one of its RUNNABLE tasks. If that should happen we WARN and fix
> the cpu clamp value for the best.
>
> The normal path is instead '=' and, according to by previous resume,
> it's expected to be executed _only_ when we dequeue the last task of the
> clamp group defining the current uclamp(cpu) value.
Okay. I was mislead by the comment then. Thanks for clarifying.
Can this function be broken down to deal with '=' separately from the '>' case?
IIUC, for the common '=', we always want to return uclamp_idle_value() hence
skip the potentially expensive scan?
Anyway, based on Vincent results, it doesn't seem this path is an issue for him
and the real problem is lurking somewhere else.
>
> > Mind trying with the below patch please?
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0464569f26a7..50d66d4016ff 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1063,7 +1063,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
> > * e.g. due to future modification, warn and fixup the expected value.
> > */
> > SCHED_WARN_ON(bucket->value > rq_clamp);
> > - if (bucket->value >= rq_clamp) {
> > + if (bucket->value > rq_clamp) {
> > bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
> > WRITE_ONCE(uc_rq->value, bkt_clamp);
> > }
>
> This patch is thus bogus, since we never expect to have uclamp(cpu)
> bigger than uclamp(task) and thus we will never reset the clamp value of
> a cpu.
Yeah I got confused by SCHED_WARN_ON() and the comment. Thanks for clarifying.
Cheers
--
Qais Yousef
next prev parent reply other threads:[~2020-06-05 11:32 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-11 15:40 [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value Qais Yousef
2020-05-11 15:40 ` [PATCH 2/2] Documentation/sysctl: Document uclamp sysctl knobs Qais Yousef
2020-05-11 17:18 ` [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value Qais Yousef
2020-05-12 2:10 ` Pavan Kondeti
2020-05-12 11:46 ` Qais Yousef
2020-05-15 11:08 ` Patrick Bellasi
2020-05-18 8:31 ` Dietmar Eggemann
2020-05-18 16:49 ` Qais Yousef
2020-05-28 13:23 ` Peter Zijlstra
2020-05-28 15:58 ` Qais Yousef
2020-05-28 16:11 ` Peter Zijlstra
2020-05-28 16:51 ` Qais Yousef
2020-05-28 18:29 ` Peter Zijlstra
2020-05-28 19:08 ` Patrick Bellasi
2020-05-28 19:20 ` Dietmar Eggemann
2020-05-29 9:11 ` Qais Yousef
2020-05-29 10:21 ` Mel Gorman
2020-05-29 15:11 ` Qais Yousef
2020-05-29 16:02 ` Mel Gorman
2020-05-29 16:05 ` Qais Yousef
2020-05-29 10:08 ` Mel Gorman
2020-05-29 16:04 ` Qais Yousef
2020-05-29 16:57 ` Mel Gorman
2020-06-02 16:46 ` Dietmar Eggemann
2020-06-03 8:29 ` Patrick Bellasi
2020-06-03 10:10 ` Mel Gorman
2020-06-03 14:59 ` Vincent Guittot
2020-06-03 16:52 ` Qais Yousef
2020-06-04 12:14 ` Vincent Guittot
2020-06-05 10:45 ` Qais Yousef
2020-06-09 15:29 ` Vincent Guittot
2020-06-08 12:31 ` Qais Yousef
2020-06-08 13:06 ` Valentin Schneider
2020-06-08 14:44 ` Steven Rostedt
2020-06-11 10:13 ` Qais Yousef
2020-06-09 17:10 ` Vincent Guittot
2020-06-11 10:24 ` Qais Yousef
2020-06-11 12:01 ` Vincent Guittot
2020-06-23 15:44 ` Qais Yousef
2020-06-24 8:45 ` Vincent Guittot
2020-06-05 7:55 ` Patrick Bellasi
2020-06-05 11:32 ` Qais Yousef [this message]
2020-06-05 13:27 ` Patrick Bellasi
2020-06-03 9:40 ` Mel Gorman
2020-06-03 12:41 ` Qais Yousef
2020-06-04 13:40 ` Mel Gorman
2020-06-05 10:58 ` Qais Yousef
2020-06-11 10:58 ` Qais Yousef
2020-06-16 11:08 ` Qais Yousef
2020-06-16 13:56 ` Lukasz Luba
-- strict thread matches above, loose matches on Subject: below --
2020-04-03 12:30 Qais Yousef
2020-04-14 18:21 ` Patrick Bellasi
2020-04-15 7:46 ` Patrick Bellasi
2020-04-20 15:04 ` Qais Yousef
2020-04-20 8:24 ` Dietmar Eggemann
2020-04-20 15:19 ` Qais Yousef
2020-04-21 0:52 ` Steven Rostedt
2020-04-21 11:16 ` Dietmar Eggemann
2020-04-21 11:23 ` Qais Yousef
2020-04-20 14:50 ` Qais Yousef
2020-04-15 10:11 ` Quentin Perret
2020-04-20 15:08 ` Qais Yousef
2020-04-20 8:29 ` Dietmar Eggemann
2020-04-20 15:13 ` Qais Yousef
2020-04-21 11:18 ` Dietmar Eggemann
2020-04-21 11:27 ` Qais Yousef
2020-04-22 10:59 ` Dietmar Eggemann
2020-04-22 13:13 ` Qais Yousef
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200605113204.srskjrunz2ezkcuj@e107158-lin.cambridge.arm.com \
--to=qais.yousef@arm.com \
--cc=bsegall@google.com \
--cc=corbet@lwn.net \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=keescook@chromium.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mcgrof@kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=patrick.bellasi@matbug.net \
--cc=peterz@infradead.org \
--cc=pkondeti@codeaurora.org \
--cc=qperret@google.com \
--cc=rdunlap@infradead.org \
--cc=rostedt@goodmis.org \
--cc=valentin.schneider@arm.com \
--cc=vincent.guittot@linaro.org \
--cc=yzaikin@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).