Re: [GIT pull] sched/core for v5.16-rc1

From: Mark Rutland <mark.rutland@arm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Kees Cook <keescook@chromium.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	the arch/x86 maintainers <x86@kernel.org>
Subject: Re: [GIT pull] sched/core for v5.16-rc1
Date: Wed, 3 Nov 2021 13:52:49 +0000	[thread overview]
Message-ID: <20211103135249.GA38767@C02TD0UTHF1T.local> (raw)
In-Reply-To: <YYD5ti23DQUjdQdz@hirez.programming.kicks-ass.net>

On Tue, Nov 02, 2021 at 09:41:26AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 01, 2021 at 02:27:49PM -0700, Linus Torvalds wrote:
> > On Mon, Nov 1, 2021 at 2:01 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > Unwinders that need locks because they can do bad things if they are
> > > working on unstable data are EVIL and WRONG.
> > 
> > Note that this is fundamental: if you can fool an unwider to do
> > something bad just because the data isn't stable, then the unwinder is
> > truly horrendously buggy, and not usable.
> 
> From what I've been led to believe, quite a few of our arch unwinders
> seem to fall in that category. They're mostly only happy when unwinding
> self and don't have many guardrails on otherwise.
> 
> > It could be a user process doing bad things to the user stack frame
> > from another thread when profiling is enabled.
> 
> Most of the unwinders seem to only care about the kernel stack. Not the
> user stack.

Yup; there are usually separate unwinders for user/kernel, since there
are different constaints (and potentially different ABIs for unwinding).

> > It could be debug code unwinding without locks for random reasons.
> > 
> > So I really don't like "take a lock for unwinding". It's a pretty bad
> > bug if the lock required.
> 
> Fair enough; te x86 unwinder is pretty robust in this regard, but it
> seems to be one of few :/

FWIW, the arm64 kernel unwinder also shouldn't blow up (so long as the
target stack is pinned via try_get_stack() or similar).

However, depending on how the task reuses the stack, the results can be
entirely bogus rather than just stale, since data on the stack can look
like a kernel pointer (even if that's fairly unllikely). I'm happy to
believe that we don't care aobut that for wchan, but it's not something
I'd like to see spread.

> > The "Link" in the commit also is entirely useless, pointing back to
> > the emailed submission of the patch, rather than any useful discussion
> > about why the patch happened.
> 
> So the initial discussion started here:
> 
>   https://lkml.kernel.org/r/20210923233105.4045080-1-keescook@chromium.org
> 
> A later thread that might also be of interest is:
> 
>   https://lkml.kernel.org/r/YWgyy+KvNLQ7eMIV@shell.armlinux.org.uk
> 
> Also, an even later thread proposes to push that lock into more stack
> unwinding functions (anything doing remote unwinds):
> 
>   https://lkml.kernel.org/r/20211022150933.883959987@infradead.org
> 
> But it seems to be you're thinking that's fundamentally buggered and
> people should instead invest in fixing their unwinders already.
> 
> Now, as is, this stuff is user exposed through /proc/$pid/{wchan,stack}
> and as such I think it *can* do with a few extra guardrails in generic
> code. OTOH, /proc/$pid/stack is root only.
> 
> Also, the remote stack-trace code is hooked into bpf (because
> kitchen-sink) and while I didn't look too hard, I can imagine it could
> be used to trigger crashes on our less robust architectures if prodded
> just right.

I do worry that remote unwinds from BPF are just silently generating
junk, but it's not clear to me what they're actually used for and how
much that matters. I don't understand why a remote unwind is necessary
at all.

> Should I care about all this from a generic code PoV, or simply let the
> architectures that got it 'wrong' deal with it?

FWIW I'm happy either way. There are some upcoming improvements to the
arm64 unwinder that currently conflict and I need to know whether to
wait and rebase or assume that we take those first.

Thanks,
Mark.