Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

From: Tejun Heo <tj@kernel.org>
To: Josh Don <joshdon@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	linux-kernel@vger.kernel.org,
	Joel Fernandes <joel@joelfernandes.org>
Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth
Date: Tue, 1 Nov 2022 11:49:55 -1000	[thread overview]
Message-ID: <Y2GUg8CiI68ZBznr@slm.duckdns.org> (raw)
In-Reply-To: <CABk29Nua8ZsDfhY+x+VfYDkbkjfXLXTZ5JMVR9uiBygraxDM+g@mail.gmail.com>

Hello,

On Tue, Nov 01, 2022 at 01:56:29PM -0700, Josh Don wrote:
> Maybe walking through an example would be helpful? I don't know if
> there's anything super specific. For cgroup_mutex for example, the
> same global mutex is being taken for things like cgroup mkdir and
> cgroup proc attach, regardless of which part of the hierarchy is being
> modified. So, we end up sharing that mutex between random job threads
> (ie. that may be manipulating their own cgroup sub-hierarchy), and
> control plane threads, which are attempting to manage root-level
> cgroups. Bad things happen when the cgroup_mutex (or similar) is held
> by a random thread which blocks and is of low scheduling priority,
> since when it wakes back up it may take quite a while for it to run
> again (whether that low priority be due to CFS bandwidth, sched_idle,
> or even just O(hundreds) of threads on a cpu). Starving out the
> control plane causes us significant issues, since that affects machine
> health. cgroup manipulation is not a hot path operation, but the
> control plane tends to hit it fairly often, and so those things
> combine at our scale to produce this rare problem.

I keep asking because I'm curious about the specific details of the
contentions. Control plane locking up is obviously bad but they can usually
tolerate some latencies - stalling out multiple seconds (or longer) can be
catastrophic but tens or hundreds or millisecs occasionally usually isn't.

The only times we've seen latency spikes from CPU side which is enough to
cause system-level failures were when there were severe restrictions through
bw control. Other cases sure are possible but unless you grab these mutexes
while IDLE inside a heavily contended cgroup (which is a bit silly) you
gotta push *really* hard.

If most of the problems were with cpu bw control, fixing that should do for
the time being. Otherwise, we'll have to think about finishing kernfs
locking granularity improvements and doing something similar to cgroup
locking too.

Thanks.

-- 
tejun