Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch

From: Thomas Gleixner <tglx@linutronix.de>
To: "Singh\, Balbir" <sblbir@amazon.com>,
	"linux-kernel\@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: "keescook\@chromium.org" <keescook@chromium.org>, "Herrenschmidt\,
	Benjamin" <benh@amazon.com>, "x86\@kernel.org" <x86@kernel.org>
Subject: Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch
Date: Sat, 21 Mar 2020 11:05:32 +0100	[thread overview]
Message-ID: <87d096rpjn.fsf@nanos.tec.linutronix.de> (raw)
In-Reply-To: <034a2c0e2cc1bb0f4f7ff9a2c5cbdc269a483a71.camel@amazon.com>

Balbir,

"Singh, Balbir" <sblbir@amazon.com> writes:
> On Fri, 2020-03-20 at 12:49 +0100, Thomas Gleixner wrote:
>> I forgot the gory details by now, but having two entry points or a
>> conditional and share the rest (page allocation etc.) is definitely
>> better than two slightly different implementation which basically do the
>> same thing.
>
> OK, I can try and dedup them to the extent possible, but please do remember
> that 
>
> 1. KVM is usually loaded as a module
> 2. KVM is optional
>
> We can share code, by putting the common bits in the core kernel.

Obviously so.

>> > 1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)
>> 
>> How so? SWAPGS mitigation does not flush L1D. It merily serializes SWAPGS.
>
> Sorry, my bad, I was thinking MDS_CLEAR (via verw), which does flush out
> things, which I suspect should be sufficient from a return to user/signal
> handling, etc perspective.

MDS is affecting store buffers, fill buffers and load ports. Different story.

> Right now, reading through 
> https://software.intel.com/security-software-guidance/insights/deep-dive-snoop-assisted-l1-data-sampling
> , it does seem like we need this during a context switch, specifically since a
> dirty cache line can cause snooped reads for the attacker to leak data. Am I
> missing anything?

Yes. The way this goes is:

CPU0                   CPU1

victim1
 store secrit
                        victim2
attacker                  read secrit

Now if L1D is flushed on CPU0 before attacker reaches user space,
i.e. reaches the attack code, then there is nothing to see. From the
link:

  Similar to the L1TF VMM mitigations, snoop-assisted L1D sampling can be
  mitigated by flushing the L1D cache between when secrets are accessed
  and when possibly malicious software runs on the same core.

So the important point is to flush _before_ the attack code runs which
involves going back to user space or guest mode.

>> Even this is uninteresting:
>> 
>>     victim in -> attacker in (stays in kernel, e.g. waits for data) ->
>>     attacker out -> victim in
>> 
>
> Not from what I understand from the link above, the attack is a function of
> what can be snooped by another core/thread and that is a function of what
> modified secrets are in the cache line/store buffer.

Forget HT. That's not fixable by any flushing simply because there is no
scheduling involved.

CPU0  HT0          CPU0 HT1		CPU1

victim1            attacker
 store secrit
                        		victim2
                                          read secrit

> On return to user, we already use VERW (verw), but just return to user
> protection is not sufficient IMHO. Based on the link above, we need to clear
> the L1D cache before it can be snooped.

Again. Flush is required between store and attacker running attack
code. The attacker _cannot_ run attack code while it is in the kernel so
flushing L1D on context switch is just voodoo.

If you want to cure the HT case with core scheduling then the scenario
looks like this:

CPU0  HT0          CPU0 HT1		CPU1

victim1            IDLE
 store secrit
-> IDLE
                   attacker in 		victim2
                                          read secrit

And yes, there the context switch flush on HT0 prevents it. So this can
be part of a core scheduling based mitigation or handled via a per core
flush request.

But HT is attackable in so many ways ...

Thanks,

        tglx