Re: [GIT PULL] x86/mm changes for v5.8

From: "Singh, Balbir" <sblbir@amazon.com>
To: "torvalds@linux-foundation.org" <torvalds@linux-foundation.org>,
	"tglx@linutronix.de" <tglx@linutronix.de>
Cc: "mingo@kernel.org" <mingo@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"keescook@chromium.org" <keescook@chromium.org>,
	"a.p.zijlstra@chello.nl" <a.p.zijlstra@chello.nl>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"luto@kernel.org" <luto@kernel.org>,
	"bp@alien8.de" <bp@alien8.de>,
	"benh@kernel.crashing.org" <benh@kernel.crashing.org>
Subject: Re:  [GIT PULL] x86/mm changes for v5.8
Date: Tue, 2 Jun 2020 23:01:17 +0000	[thread overview]
Message-ID: <105f5a87b689eab38baf4d51d03e9f9707e74c66.camel@amazon.com> (raw)
In-Reply-To: <CAHk-=wgOFnMW-EgymmrTyqTPLrpGJrUJ_wBzehMpyT=SO4-JRQ@mail.gmail.com>

On Tue, 2020-06-02 at 12:14 -0700, Linus Torvalds wrote:
> 
> On Tue, Jun 2, 2020 at 11:29 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > It's trivial enough to fix. We have a static key already which is
> > telling us whether SMT scheduling is active.
> 
> .. but should we do it here, in switch_mm() in the first place?
> 
> Should it perhaps just return an error if user land tries to set the
> "flush L1$" thing on an SMT system? And no, I don't think we care at
> all if people then start playing games and enabling/disabling SMT
> dynamically while applications are working. At that point the admin
> kets to keep both of the broken pieces.
> 
> Also, see my other point about how "switch_mm()" really isn't even a
> protection domain switch to begin with. We're still in the exact same
> protection domain we used to be in, with the exact same direct access
> to L1D$.
> 
> Why would we flush the caches on a random and irrelevant event in
> kernel space? switch_mm() simply isn't relevant for caches (well,
> unless you have fully virtual ones, but that's a completely different
> issue).
> 
> Wouldn't it be more sensible to treat it more like TIF_NEED_FPU_LOAD -
> have a per-cpu "I need to flush the cache" variable, and then the only
> thing a context switch does is to see if the user changed (or
> whatever) and then set the bit, and set TIF_NOTIFY_RESUME in the
> thread.
> 
> Because the L1D$ flush isn't a kernel issue, it's a "don't let user
> space try to attack it" issue. The kernel can already read it if it
> wants to.
> 
> And that's just the "big picture" issues I see. In the big picture,
> doing this when SMT is enabled is unbelievably stupid. And in the big
> picture, context switch really isn't a security domain change wrt the
> L1D$.
>

+ Kees (sorry my bad, I should have added him earlier)

I am going to take a step back and point to

https://software.intel.com/security-software-guidance/software-guidance/snoop-assisted-l1-data-sampling
https://software.intel.com/security-software-guidance/insights/deep-dive-snoop-assisted-l1-data-sampling

The suggested mitigation is

"Snoop-assisted L1D sampling could be mitigated by flushing the L1D cache
before executing potentially malicious applications"

We discussed the mitigations in an RFC

https://lore.kernel.org/lkml/20200313220415.856-1-sblbir@amazon.com/

switch_mm() was chosen as a trade-off between, how long do we keep the data
in the cache vs how frequently do we flush. What your suggesting is that we
use switch_mm() + return to user mode to decide when the security domain
changes?

> The more I look at those patches, the more I go "that's just wrong" on
> some of the "small picture" implementation details.
> 
> Here's just a few cases that I reacted to
> 
> Actual low-level flushing code:
> 
>  (1) the SW fallback
> 
>      (a) is only defined on Intel, and initializing the silly cache
> flush pages on any other vendor will fail.
> 
>      (b) seems to assume that 16 pages (order-4) is sufficient and
> necessary. Probably "good enough", but it's also an example of "yeah,
> that's expensive".
> 

The software flush is largely code reuse assuming the L1TF bits
were right

>      (c) and if I read the code correctly, trying to flush the L1D$ on
> non-intel without the HW support, it causes a WARN_ON_ONCE()! WTF?

That is not correct, the function only complains if we do a software fallback
flush without allocating the flush pages. That function is not exposed without
the user using the prctl() API, which allocates those flush pages. The other
user of this API is the L1TF flush logic

> 
>  (2) the HW case is done for any vendor, if it reports the "I have the MSR"

No l1d_flush_init_once() fails for users opting in via the prctl(), it
succeeds for users of L1TF.

> 
>  (3) the VMX support certainly has various sanity checks like "oh, CPU
> doesn't have X86_BUG_L1TF, then I won't do this even if there was some
> kernel command line to say I should". But the new prctrl doesn't have
> anything like that. It just enables that L1D$ thing mindlessly,
> thinking that user-land software somehow knows what it's doing. BS.

So you'd like to see a double opt-in? Unforunately there is no gating
of the bug and I tried to make it generic - clearly calling it opt-in
flushing for the paranoid, for those who really care about CVE-2020-0550.

> 
>  (4) what does L1D_FLUSH_POPULATE_TLB mean?
> 
>    That "option" makes zero sense. It pre-populates the TLB before
> doing the accesses to the L1D$ pages in the SW case, but nothing at
> all explains why that is needed. It's clearly not needed for the
> caller, since the TLB population only happens for the SW fallback, not
> for the HW one.

Good question, I asked around in the RFC and in my email threads as
to why we needed this even for the L1TF case with no response. I decided
not to do it, unless I fully understood why it's needed.

I am happy to remove the SW fallback if needed.

> 
>    No documentation, no nothing. It's enabled for the VMX case, not
> for the non-vmx case, which makes me suspect it's some crazy "work
> around vm monitor page faults, because we know our SW flush fallback
> is just random garbage".
>

Would this make you happier?

1. Remove SW fallback flush
2. Implement a double opt-in (CAP_SYS_ADMIN for the prctl or a
   system wide disable)?
3. Ensure the flush happens only when the current core has
   SMT disabled

w.r.t. switch_mm() vs another place to flush, it is a trade-off,
would 1 to 3 convince you?

In summary there was a discussion, two RFCs and the patches were reviewed.
You'd like to see further filters to the flush, which is covered in points
1 to 3 above.

Balbir Singh.