Re: [GIT PULL] x86/mm changes for v5.8

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Balbir Singh <sblbir@amazon.com>
Cc: Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Andrew Morton <akpm@linux-foundation.org>,
	Borislav Petkov <bp@alien8.de>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrew Lutomirski <luto@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subject: Re: [GIT PULL] x86/mm changes for v5.8
Date: Mon, 1 Jun 2020 19:35:31 -0700	[thread overview]
Message-ID: <CAHk-=wjVLSOJ+uyEOhze_AaJi2mpmOAn7QtKahO1DC-Oj5teuA@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wgXf_wQ9zrJKv2Hy4EpEbLuqty-Cjbs2u00gm7XcYHBfw@mail.gmail.com>

On Mon, Jun 1, 2020 at 6:06 PM Balbir Singh <sblbir@amazon.com> wrote:
>
> I think apps can do this independently today as in do the flush
> via software fallback in the app themselves.

Sure, but they can't force the kernel to do crazy things for every task switch.

> But I see your concern with
> respect to an overall admin override, I thought about it, but I stopped
> because of the reason above.

So my real concern is that I had a hard time following all the logic,
but it _looked_ to me like it ends up doing the flushes even if SMT is
enabled, for example.

And that's not an "admin override". That's just a full no-no. If SMT
is on, then we don't do any L1D$ flushing in the kernel, because it
would ne entirely pointless.

But this all quite possibly interacts with all the subtle static
branches we have in this same area, so ...

> The software fallback was there and is the same algorithm that we use
> for L1TF. We could potentially remove the software fallback if there
> is a concern, having the software fallback provides a mechanism
> independent of the microcode version similar to L1TF.

So the SW fallback raises my hackles, because we had the exact same
kind of non-architected software fallback for clearing the data
buffers for MDS.

And that one turned out to be not only incredibly expensive, but it
didn't work reliably anyway, and was really only written for one
microarchitecture.

It's not clear that a (generic) SW routine _can_ reliably flush the
L1D$ simply because of replacement policy issues and possibly D$
cacheline management settings.

Again, I'm sure that for any particular microarchitecture, writing a
SW fallback is possible.

And I suspect that for most of them, "fill enough of the cache" ends
up making things very hard to trigger in practice, but at a very high
cost.

> >I have a hard time following whether this might all end up being
> >predicated on the STIBP static branch conditionals and might thus at
> >least be limited only to CPU's that have the problem in the first
> >place.
>
> No, this is at the moment restricted to just Intel CPUs and it is designed
> as a generic fallback mechanism for issues involving speculation and L1D
> flush for example CVE-2020-0550[1]. This mechanism addresses issues beyond
> what STIBP addresses.

Ok, so that was hard to see.

What is it that disables it exactly? I'm ok with the concept, but it
needs to be clearly targeted, and the target wasn't clear to me. In
fact, everything in the docs and changlog implied that the target was
"

> >Because I don't want a random "I can make the kernel do stupid things"
> >flag for people to opt into. I think it needs a double opt-in.
> >
>
> Are you happy for this controlled by CAP_SYS_ADMIN or are you suggesting
> a sysfs override by the administrator to completely disable this?

At a _minimum_, I want to make sure this code never possibly ever
starts flushing anything if SMT is on.

That may be the case already. The narrow twisty mazes here from
"enable this" to "do the flush" were too hard to follow for me, which
is why I'm asking for more clarification.

And yes, I did read the documentation you point to. That did not
clarify anything at all.

So to put this in very clear terms:

 - what happens if SMT is on, and a process does
"prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH_OUT, PR_SPEC_ENABLE,
0, 0);"

Because if the answer is "the kernel starts flushing the L1D at
context switches" then I refuse to pull this.

Why? Because it's just an incredibly stupid waste of time and effort
to do that, and I can see some poor hapless ssh developer saying "yes,
I should enable this thing because ssh is very special", and then ssh
just starts wasting time on something that doesn't actually help.

See what I'm saying? I want to know this feature isn't doing crazy
things. And "flush L1D$ on SMT" is crazy, since an attacker would just
sit on a sibling core and attack the L1 contents *before* the task
switch happens.

I have some other questions about this approach in the first place. I
don't see why context switch is even relevant, and why it should be
the place we flush. The kernel is trustworthy in this situation, both
before and after the context switch. So context switch makes no
difference what-so-ever from a security domain transfer angle.

Also, shouldn't we avoid flushing if you just run as the same user all
the time? IOW, context switch in itself isn't really relevant as a
security domain transfer, but it *is* relevant in the sense that
switching from one user to another is a sign of "uhhuh, now maybe I
should be careful when returning to user mode".

IOW, think of a "pipe ping-pong" test program. Set the flag for "I
want L1D$ cache flushing". Run the program with nothing else
happening, and a _good_ implementation should never ever cache-flush,
because at no point did we ever enter untrusted space: we were either
in the kernel (not just for the system calls, but for idle threads),
or we were in user context that had the right to see the data anyway.

So despite asking for a L1D$ flush on context changes, such a
ping-pong test program that does millions of context switches per
second shouldn't actually ever cause a cache flush, because there was
never any point.

IOW, what mitigations are in place for this not doing unnecessary
cache flushes, either because they are fundamentally pointless (the
machine has SMT enabled) or because they just aren't necessary (no
transition to an untrusted security domain has happened)?

And maybe those mitigations are actually there, and I just couldn't
figure it out. Some of the code scares me ("cond_ibpb()" and friends,
even if you did rename it to "cond_mitigation()").

                   Linus