All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Don <joshdon@google.com>, Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Luis Chamberlain <mcgrof@kernel.org>,
	Kees Cook <keescook@chromium.org>,
	Iurii Zaikin <yzaikin@google.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	David Rientjes <rientjes@google.com>,
	Oleg Rombakh <olegrom@google.com>,
	linux-doc@vger.kernel.org, Paul Turner <pjt@google.com>
Subject: Re: [PATCH v2] sched: Warn on long periods of pending need_resched
Date: Wed, 24 Mar 2021 13:39:16 +0000	[thread overview]
Message-ID: <20210324133916.GQ15768@suse.de> (raw)
In-Reply-To: <YFssoD5NDl6dFfg/@hirez.programming.kicks-ass.net>

On Wed, Mar 24, 2021 at 01:12:16PM +0100, Peter Zijlstra wrote:
> On Wed, Mar 24, 2021 at 11:42:24AM +0000, Mel Gorman wrote:
> > On Wed, Mar 24, 2021 at 11:54:24AM +0100, Peter Zijlstra wrote:
> > > On Wed, Mar 24, 2021 at 10:37:43AM +0100, Peter Zijlstra wrote:
> > > > Should we perhaps take out all SCHED_DEBUG sysctls and move them to
> > > > /debug/sched/ ? (along with the existing /debug/sched_{debug,features,preemp}
> > > > files)
> > > > 
> > > > Having all that in sysctl and documented gives them far too much sheen
> > > > of ABI.
> > > 
> > > ... a little something like this ...
> > > 
> > 
> > I did not read this particularly carefully or boot it to check but some
> > of the sysctls moved are expected to exist and should never should have
> > been under SCHED_DEBUG.
> > 
> > For example, I'm surprised that numa_balancing is under the SCHED_DEBUG
> > sysctl because there are legimiate reasons to disable that at runtime.
> > For example, HPC clusters running various workloads may disable NUMA
> > balancing globally for particular jobs without wanting to reboot and
> > reenable it when finished.
> 
> Yeah, lets say I was pleasantly surprised to find it there :-)
> 

Minimally, lets move that out before it gets kicked out. Patch below.

> > Moving something like sched_min_granularity_ns will break a number of
> > tuning guides as well as the "tuned" tool which ships by default with
> > some distros and I believe some of the default profiles used for tuned
> > tweak kernel.sched_min_granularity_ns
> 
> Yeah, can't say I care. I suppose some people with PREEMPT=n kernels
> increase that to make their server workloads 'go fast'. But I'll
> absolutely suck rock on anything desktop.
> 

Broadly speaking yes and despite the lack of documentation, enough people
think of that parameter when tuning for throughput vs latency depending on
the expected use of the machine.  kernel.sched_wakeup_granularity_ns might
get tuned if preemption is causing overscheduling. Same potentially with
kernel.sched_min_granularity_ns and kernel.sched_latency_ns. That said, I'm
struggling to think of an instance where I've seen tuning recommendations
properly quantified other than the impact on microbenchmarks but I
think there will be complaining if they disappear. I suspect that some
recommended tuning is based on "I tried a number of different values and
this seemed to work reasonably well".

kernel.sched_schedstats probably should not depend in SCHED_DEBUG because
it has value for workload analysis which is not necessarily about debugging
per-se. It might simply be informing whether another variable should be
tuned or useful for debugging applications rather than the kernel.

The others I'm less concerned with. kernel.sched_tunable_scaling is very
specific. sysctl_sched_migration_cost is subtle because it affects lots
of things including whether tasks are cache hot and load balancing and
is best left alone. I wonder how many people can accurately predict how
workloads will behave when that is tuned? sched_nr_migrate is also a hard
one to tune in a sensible fashion.

As an aside, I wonder how often SCHED_DEBUG has been enabled simply
because LATENCYTOP selects it -- no idea offhand why LATENCYTOP even
needs SCHED_DEBUG.

> These knobs really shouldn't have been as widely available as they are.
> 

Probably not. Worse, some of the tuning is probably based on "this worked
for workload X 10 years ago so I'll just keep doing that"

> And guides, well, the writes have to earn a living too, right.
> 

For most of the guides I've seen they either specify values without
explaining why or just describe roughly what the parameter does and it's
not always that accurate a description.

> > Whether there are legimiate reasons to modify those values or not,
> > removing them may generate fun bug reports.
> 
> Which I'll close with -EDONTCARE, userspace has to cope with
> SCHED_DEBUG=n in any case.

True but removing the throughput vs latency parameters is likely to
generate a lot of noise even if the reasons for tuning are bad ones.
Some definitely should not be depending on SCHED_DEBUG, others may
need to be moved to debugfs one patch at a time so they can be reverted
individually if complaining is excessive and there is a legiminate reason
why it should be tuned. It's possible that complaining will be based on
a workload regression that really depended on tuned changing parameters.

Anyway, I definitely want to save kernel.numa_balancing from the firing
line so....

--8<--
sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG

From: Mel Gorman <mgorman@suse.de>

The ability to enable/disable NUMA balancing is not a debugging feature
and should not depend on CONFIG_SCHED_DEBUG.  For example, machines within
a HPC cluster may disable NUMA balancing temporarily for some jobs and
re-enable it for other jobs without needing to reboot.

This patch removes the dependency on CONFIG_SCHED_DEBUG for
kernel.numa_balancing sysctl. The other numa balancing related sysctls
are left as-is because if they need to be tuned then it is more likely
that NUMA balancing needs to be fixed instead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sysctl.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 62fbd09b5dc1..8042098ae080 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1753,6 +1753,9 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_SCHED_DEBUG */
+#ifdef CONFIG_NUMA_BALANCING
 	{
 		.procname	= "numa_balancing",
 		.data		= NULL, /* filled in by handler */
@@ -1763,7 +1766,6 @@ static struct ctl_table kern_table[] = {
 		.extra2		= SYSCTL_ONE,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
-#endif /* CONFIG_SCHED_DEBUG */
 	{
 		.procname	= "sched_rt_period_us",
 		.data		= &sysctl_sched_rt_period,

  reply	other threads:[~2021-03-24 13:40 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-23  3:57 [PATCH v2] sched: Warn on long periods of pending need_resched Josh Don
2021-03-24  9:37 ` Peter Zijlstra
2021-03-24 10:54   ` Peter Zijlstra
2021-03-24 10:55     ` Peter Zijlstra
2021-03-24 11:42     ` Mel Gorman
2021-03-24 12:12       ` Peter Zijlstra
2021-03-24 13:39         ` Mel Gorman [this message]
2021-03-24 14:36           ` Peter Zijlstra
2021-03-24 15:52             ` Mel Gorman
2021-03-25 21:58               ` Josh Don
2021-03-26  8:58                 ` Peter Zijlstra
2021-04-16 15:53           ` [tip: sched/core] sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG tip-bot2 for Mel Gorman
2021-03-24 11:27 ` [PATCH v2] sched: Warn on long periods of pending need_resched Mel Gorman
2021-03-25 21:50   ` Josh Don
2021-03-30 22:44     ` Josh Don
2021-04-16 15:04       ` Peter Zijlstra
2021-04-16 21:33         ` Josh Don
2021-04-19  7:52           ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210324133916.GQ15768@suse.de \
    --to=mgorman@suse.de \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=keescook@chromium.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=mingo@redhat.com \
    --cc=olegrom@google.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=yzaikin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.