Re: [GIT PULL] x86/mm changes for v5.8

From: "Singh, Balbir" <sblbir@amazon.com>
To: "torvalds@linux-foundation.org" <torvalds@linux-foundation.org>
Cc: "mingo@kernel.org" <mingo@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"a.p.zijlstra@chello.nl" <a.p.zijlstra@chello.nl>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"luto@kernel.org" <luto@kernel.org>,
	"bp@alien8.de" <bp@alien8.de>,
	"benh@kernel.crashing.org" <benh@kernel.crashing.org>
Subject: Re: [GIT PULL] x86/mm changes for v5.8
Date: Tue, 2 Jun 2020 10:25:17 +0000	[thread overview]
Message-ID: <7e5f2a555a0d1198d7c5713f1d4810c3086e1602.camel@amazon.com> (raw)
In-Reply-To: <CAHk-=wjVLSOJ+uyEOhze_AaJi2mpmOAn7QtKahO1DC-Oj5teuA@mail.gmail.com>

On Mon, 2020-06-01 at 19:35 -0700, Linus Torvalds wrote:
> 
> On Mon, Jun 1, 2020 at 6:06 PM Balbir Singh <sblbir@amazon.com> wrote:
> > 
> > I think apps can do this independently today as in do the flush
> > via software fallback in the app themselves.
> 
> Sure, but they can't force the kernel to do crazy things for every task switch.

On every mm switch, yes :) 

> 
> > But I see your concern with
> > respect to an overall admin override, I thought about it, but I stopped
> > because of the reason above.
> 
> So my real concern is that I had a hard time following all the logic,
> but it _looked_ to me like it ends up doing the flushes even if SMT is
> enabled, for example.
> 
> And that's not an "admin override". That's just a full no-no. If SMT
> is on, then we don't do any L1D$ flushing in the kernel, because it
> would ne entirely pointless.

Just to clarify SMT as in SMT on the current core, not the entire system.

> 
> But this all quite possibly interacts with all the subtle static
> branches we have in this same area, so ...
> 
> > The software fallback was there and is the same algorithm that we use
> > for L1TF. We could potentially remove the software fallback if there
> > is a concern, having the software fallback provides a mechanism
> > independent of the microcode version similar to L1TF.
> 
> So the SW fallback raises my hackles, because we had the exact same
> kind of non-architected software fallback for clearing the data
> buffers for MDS.
> 
> And that one turned out to be not only incredibly expensive, but it
> didn't work reliably anyway, and was really only written for one
> microarchitecture.
> 
> It's not clear that a (generic) SW routine _can_ reliably flush the
> L1D$ simply because of replacement policy issues and possibly D$
> cacheline management settings.
> 
> Again, I'm sure that for any particular microarchitecture, writing a
> SW fallback is possible.
> 
> And I suspect that for most of them, "fill enough of the cache" ends
> up making things very hard to trigger in practice, but at a very high
> cost.

Yes, I agree with that, we could turn off the fallback if that helps
and we can have this feature enabled for when we have the MSR to do the
L1D flush.

> 
> > > I have a hard time following whether this might all end up being
> > > predicated on the STIBP static branch conditionals and might thus at
> > > least be limited only to CPU's that have the problem in the first
> > > place.
> > 
> > No, this is at the moment restricted to just Intel CPUs and it is designed
> > as a generic fallback mechanism for issues involving speculation and L1D
> > flush for example CVE-2020-0550[1]. This mechanism addresses issues beyond
> > what STIBP addresses.
> 
> Ok, so that was hard to see.
> 
> What is it that disables it exactly? I'm ok with the concept, but it
> needs to be clearly targeted, and the target wasn't clear to me. In
> fact, everything in the docs and changlog implied that the target was
> "

It addresses issues associated with "Snoop-Assisted L1D Sampling". Is that
what you referred to as target?

> 
> > > Because I don't want a random "I can make the kernel do stupid things"
> > > flag for people to opt into. I think it needs a double opt-in.
> > > 
> > 
> > Are you happy for this controlled by CAP_SYS_ADMIN or are you suggesting
> > a sysfs override by the administrator to completely disable this?
> 
> At a _minimum_, I want to make sure this code never possibly ever
> starts flushing anything if SMT is on.
> 
> That may be the case already. The narrow twisty mazes here from
> "enable this" to "do the flush" were too hard to follow for me, which
> is why I'm asking for more clarification.
> 
> And yes, I did read the documentation you point to. That did not
> clarify anything at all.
> 
> So to put this in very clear terms:
> 
>  - what happens if SMT is on, and a process does
> "prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH_OUT, PR_SPEC_ENABLE,
> 0, 0);"
> 
> Because if the answer is "the kernel starts flushing the L1D at
> context switches" then I refuse to pull this.
> 
> Why? Because it's just an incredibly stupid waste of time and effort
> to do that, and I can see some poor hapless ssh developer saying "yes,
> I should enable this thing because ssh is very special", and then ssh
> just starts wasting time on something that doesn't actually help.
> 
> See what I'm saying? I want to know this feature isn't doing crazy
> things. And "flush L1D$ on SMT" is crazy, since an attacker would just
> sit on a sibling core and attack the L1 contents *before* the task
> switch happens.
> 
> I have some other questions about this approach in the first place. I
> don't see why context switch is even relevant, and why it should be
> the place we flush. The kernel is trustworthy in this situation, both
> before and after the context switch. So context switch makes no
> difference what-so-ever from a security domain transfer angle.
> 
> Also, shouldn't we avoid flushing if you just run as the same user all
> the time? IOW, context switch in itself isn't really relevant as a
> security domain transfer, but it *is* relevant in the sense that
> switching from one user to another is a sign of "uhhuh, now maybe I
> should be careful when returning to user mode".
>

The example you list above, will cause flushing if mm context switches
on the core. I've tested the implementation with such a program and seen
context switches only when another task (different mm) is on the same
CPU as the one with the program that needs to flush. Normally on an idle
system where nothing else with a different mm gets scheduled on the same
core as the program wanting to flush L1D, I see no flushes.

On the user front, I am not so sure. A user can host multiple tasks and
if one of them was compromised, it would be bad to let it allow the leak
to happen. For example if the plugin in a browser could leak a security
key of a secure session, that would be bad.

> IOW, think of a "pipe ping-pong" test program. Set the flag for "I
> want L1D$ cache flushing". Run the program with nothing else
> happening, and a _good_ implementation should never ever cache-flush,
> because at no point did we ever enter untrusted space: we were either
> in the kernel (not just for the system calls, but for idle threads),
> or we were in user context that had the right to see the data anyway.
> 
> So despite asking for a L1D$ flush on context changes, such a
> ping-pong test program that does millions of context switches per
> second shouldn't actually ever cause a cache flush, because there was
> never any point.
> 
> IOW, what mitigations are in place for this not doing unnecessary
> cache flushes, either because they are fundamentally pointless (the
> machine has SMT enabled) or because they just aren't necessary (no
> transition to an untrusted security domain has happened)?

We tried to bring them down quite a bit by applying the flush on
mm switches, we could bring it down by applying a SMT filter for the core
(core is running single threaded).

> 
> And maybe those mitigations are actually there, and I just couldn't
> figure it out. Some of the code scares me ("cond_ibpb()" and friends,
> even if you did rename it to "cond_mitigation()").
> 

I reused some of those bits from earlier code, largely because of the overlap
with the use case of flushing on mm context change.

It sounds like changing the flush to do the following:

Avoiding the L1D flush if the current core is in SMT mode would address
your concern?

For user space considerations prior to flushing, I provided some arguments
above.

Balbir Singh.