Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

From: Tejun Heo <tj@kernel.org>
To: Josh Don <joshdon@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	linux-kernel@vger.kernel.org,
	Joel Fernandes <joel@joelfernandes.org>
Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth
Date: Mon, 31 Oct 2022 11:50:12 -1000	[thread overview]
Message-ID: <Y2BDFNpkSawKnE9S@slm.duckdns.org> (raw)
In-Reply-To: <CABk29Nu=XcjwRxnGBtKHfknxnDPpspghou06+W0fufnkGF6NkA@mail.gmail.com>

Hello,

On Mon, Oct 31, 2022 at 02:22:42PM -0700, Josh Don wrote:
> > So, TJ has been complaining about us throttling in kernel-space, causing
> > grief when we also happen to hold a mutex or some other resource and has
> > been prodding us to only throttle at the return-to-user boundary.
> 
> Yea, we've been having similar priority inversion issues. It isn't
> limited to CFS bandwidth though, such problems are also pretty easy to
> hit with configurations of shares, cpumasks, and SCHED_IDLE. I've

We need to distinguish between work-conserving and non-work-conserving
control schemes. Work-conserving ones - such as shares and idle - shouldn't
affect the aggregate amount of work the system can perform. There may be
local and temporary priority inversions but they shouldn't affect the
throughput of the system and the scheduler should be able to make the
eventual resource distribution conform to the configured targtes.

CPU affinity and bw control are not work conserving and thus cause a
different class of problems. While it is possible to slow down a system with
overly restrictive CPU affinities, it's a lot harder to do so severely
compared to BW control because no matter what you do, there's still at least
one CPU which can make full forward progress. BW control, it's really easy
to stall the entire system almost completely because we're giving userspace
the ability to stall tasks for an arbitrary amount of time at random places
in the kernel. This is what cgroup1 freezer did which had exactly the same
problems.

> chatted with the folks working on the proxy execution patch series,
> and it seems like that could be a better generic solution to these
> types of issues.

Care to elaborate?

> Throttle at return-to-user seems only mildly beneficial, and then only
> really with preemptive kernels. Still pretty easy to get inversion
> issues, e.g. a thread holding a kernel mutex wake back up into a
> hierarchy that is currently throttled, or a thread holding a kernel
> mutex exists in the hierarchy being throttled but is currently waiting
> to run.

I don't follow. If you only throttle at predefined safe spots, the easiest
place being the kernel-user boundary, you cannot get system-wide stalls from
BW restrictions, which is something the kernel shouldn't allow userspace to
cause. In your example, a thread holding a kernel mutex waking back up into
a hierarchy that is currently throttled should keep running in the kernel
until it encounters such safe throttling point where it would have released
the kernel mutex and then throttle.

Thanks.

-- 
tejun