LKML Archive on lore.kernel.org
 help / color / Atom feed
* [GIT PULL] x86/mm changes for v5.8
@ 2020-06-01 17:01 Ingo Molnar
  2020-06-01 21:42 ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2020-06-01 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Thomas Gleixner, Borislav Petkov, Peter Zijlstra,
	Andrew Morton, Andy Lutomirski

Linus,

Please pull the latest x86/mm git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-mm-2020-06-01

   # HEAD: 0fcfdf55db9e1ecf85edd6aa8d0bc78a448cb96a Documentation: Add L1D flushing Documentation

Misc changes:

 - Unexport various PAT primitives

 - Unexport per-CPU tlbstate

 - Provide an opt-in (prctl driven) mechanism to flush the L1D cache on context switch.
   The goal is to allow tasks that are paranoid due to the recent snoop assisted data
   sampling vulnerabilites, to flush their L1D on being switched out.
   This protects their data from being snooped or leaked via side channels
   after the task has context switched out.

Signed-off-by: Ingo Molnar <mingo@kernel.org>

  out-of-topic modifications in x86/mm:
  ---------------------------------------
  include/uapi/linux/prctl.h         # edf7ce0b231c: prctl: Hook L1D flushing in 

 Thanks,

	Ingo

------------------>
Balbir Singh (7):
      x86/kvm: Refactor L1D flush page management
      x86/kvm: Refactor L1D flush operations
      x86/mm: Refactor cond_ibpb() to support other use cases
      x86/kvm: Refactor L1D flushing
      x86/mm: Optionally flush L1D on context switch
      prctl: Hook L1D flushing in via prctl
      Documentation: Add L1D flushing Documentation

Borislav Petkov (1):
      x86/tlb/uv: Add a forward declaration for struct flush_tlb_info

Christoph Hellwig (5):
      x86/mm: Add a x86_has_pat_wp() helper
      x86/mm: Move pgprot2cachemode out of line
      x86/mm: Cleanup pgprot_4k_2_large() and pgprot_large_2_4k()
      x86/mm: Unexport __cachemode2pte_tbl
      x86/mm: Use pgprotval_t in protval_4k_2_large() and protval_large_2_4k()

Thomas Gleixner (17):
      x86/tlb: Uninline __get_current_cr3_fast()
      x86/cpu: Uninline CR4 accessors
      x86/cr4: Sanitize CR4.PCE update
      x86/alternatives: Move temporary_mm helpers into C
      x86/tlb: Move __flush_tlb() out of line
      x86/tlb: Move __flush_tlb_global() out of line
      x86/tlb: Move __flush_tlb_one_user() out of line
      x86/tlb: Move __flush_tlb_one_kernel() out of line
      x86/tlb: Move flush_tlb_others() out of line
      x86/tlb: Move __flush_tlb_all() out of line
      x86/tlb: Move paravirt_tlb_remove_table() to the usage site
      x86/tlb: Move cr4_set_bits_and_update_boot() to the usage site
      x86/tlb: Uninline nmi_uaccess_okay()
      x86/tlb: Move PCID helpers where they are used
      xen/privcmd: Remove unneeded asm/tlb.h include
      x86/tlb: Restrict access to tlbstate
      x86/cpu: Export native_write_cr4() only when CONFIG_LKTDM=m


 Documentation/admin-guide/hw-vuln/index.rst     |   1 +
 Documentation/admin-guide/hw-vuln/l1d_flush.rst |  51 +++
 Documentation/userspace-api/spec_ctrl.rst       |   7 +
 arch/x86/events/core.c                          |  11 +-
 arch/x86/include/asm/cacheflush.h               |   8 +
 arch/x86/include/asm/memtype.h                  |   3 +
 arch/x86/include/asm/mmu_context.h              |  88 +----
 arch/x86/include/asm/paravirt.h                 |  12 +-
 arch/x86/include/asm/pgtable_32.h               |   2 +-
 arch/x86/include/asm/pgtable_types.h            |  44 +--
 arch/x86/include/asm/thread_info.h              |   9 +-
 arch/x86/include/asm/tlbflush.h                 | 443 +++-------------------
 arch/x86/include/asm/uv/uv.h                    |   1 +
 arch/x86/kernel/Makefile                        |   1 +
 arch/x86/kernel/alternative.c                   |  55 +++
 arch/x86/kernel/cpu/bugs.c                      |  28 ++
 arch/x86/kernel/cpu/common.c                    |  25 +-
 arch/x86/kernel/cpu/mtrr/generic.c              |   4 +-
 arch/x86/kernel/l1d_flush.c                     | 120 ++++++
 arch/x86/kernel/paravirt.c                      |  21 +-
 arch/x86/kernel/process.c                       |  11 +
 arch/x86/kvm/vmx/vmx.c                          |  62 +---
 arch/x86/mm/init.c                              |  44 ++-
 arch/x86/mm/init_64.c                           |   4 +-
 arch/x86/mm/ioremap.c                           |  10 +-
 arch/x86/mm/kmmio.c                             |   2 +-
 arch/x86/mm/mem_encrypt.c                       |   2 +-
 arch/x86/mm/pat/set_memory.c                    |   7 +-
 arch/x86/mm/pgtable.c                           |  16 +-
 arch/x86/mm/pgtable_32.c                        |   2 +-
 arch/x86/mm/tlb.c                               | 471 ++++++++++++++++++++++--
 arch/x86/platform/uv/tlb_uv.c                   |   4 +-
 drivers/xen/privcmd.c                           |   1 -
 include/uapi/linux/prctl.h                      |   1 +
 34 files changed, 899 insertions(+), 672 deletions(-)
 create mode 100644 Documentation/admin-guide/hw-vuln/l1d_flush.rst
 create mode 100644 arch/x86/kernel/l1d_flush.c

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] x86/mm changes for v5.8
  2020-06-01 17:01 [GIT PULL] x86/mm changes for v5.8 Ingo Molnar
@ 2020-06-01 21:42 ` Linus Torvalds
  2020-06-02  2:35   ` Linus Torvalds
  2020-06-02  7:33   ` Ingo Molnar
  0 siblings, 2 replies; 15+ messages in thread
From: Linus Torvalds @ 2020-06-01 21:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Andrew Morton, Andy Lutomirski

On Mon, Jun 1, 2020 at 10:01 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>  - Provide an opt-in (prctl driven) mechanism to flush the L1D cache on context switch.
>    The goal is to allow tasks that are paranoid due to the recent snoop assisted data
>    sampling vulnerabilites, to flush their L1D on being switched out.

Am I mis-reading this?

Because it looks to me like this basically exports cache flushing
instructions to user space, and gives processes a way to just say
"slow down anybody else I schedule with too".

I don't see a way for a system admin to say "this is stupid, don't do it".

In other words, from what I can tell, this takes the crazy "Intel
ships buggy CPU's and it causes problems for virtualization" code
(which I didn't much care about), and turns it into "anybody can opt
in to this disease, and now it affects even people and CPU's that
don't need it and configurations where it's completely pointless".

To make matters worse, it has that SW flushing fallback that isn't
even architectural from what I remember of the last time it was
discussed, but most certainly will waste a lot of time going through
the motions that may or may not flush the L1D after all.

I don't want some application to go "Oh, I'm _soo_ special and pretty
and such a delicate flower, that I want to flush the L1D on every task
switch, regardless of what CPU I am on, and regardless of whether
there are errata or not".

Because that app isn't just slowing down itself, it's slowing down others too.

I have a hard time following whether this might all end up being
predicated on the STIBP static branch conditionals and might thus at
least be limited only to CPU's that have the problem in the first
place.

But I ended up unpulling it because I can't figure that out, and the
explanations in the commits don't clarify (and do imply that it's
regardless of any other errata, since it's for "undiscovered future
errata").

Because I don't want a random "I can make the kernel do stupid things"
flag for people to opt into. I think it needs a double opt-in.

At a _minimum_, SMT being enabled should disable this kind of crazy
pseudo-security entirely, since it is completely pointless in that
situation. Scheduling simply isn't a synchronization point with SMT
on, so saying "sure, I'll flush the L1 at context switch" is beyond
stupid.

I do not want the kernel to do things that seem to be "beyond stupid".

Because I really think this is just PR and pseudo-security, and I
think there's a real cost in making people think "oh, I'm so special
that I should enable this".

I'm more than happy to be educated on why I'm wrong, but for now I'm
unpulling it for lack of data.

Maybe it never happens on SMT because of all those subtle static
branch rules, but I'd really like to that to be explained.

                    Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] x86/mm changes for v5.8
  2020-06-01 21:42 ` Linus Torvalds
@ 2020-06-02  2:35   ` Linus Torvalds
  2020-06-02 10:25     ` Singh, Balbir
  2020-06-02  7:33   ` Ingo Molnar
  1 sibling, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2020-06-02  2:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Borislav Petkov,
	Linux Kernel Mailing List, Andrew Lutomirski, Thomas Gleixner,
	Benjamin Herrenschmidt

On Mon, Jun 1, 2020 at 6:06 PM Balbir Singh <sblbir@amazon.com> wrote:
>
> I think apps can do this independently today as in do the flush
> via software fallback in the app themselves.

Sure, but they can't force the kernel to do crazy things for every task switch.

> But I see your concern with
> respect to an overall admin override, I thought about it, but I stopped
> because of the reason above.

So my real concern is that I had a hard time following all the logic,
but it _looked_ to me like it ends up doing the flushes even if SMT is
enabled, for example.

And that's not an "admin override". That's just a full no-no. If SMT
is on, then we don't do any L1D$ flushing in the kernel, because it
would ne entirely pointless.

But this all quite possibly interacts with all the subtle static
branches we have in this same area, so ...

> The software fallback was there and is the same algorithm that we use
> for L1TF. We could potentially remove the software fallback if there
> is a concern, having the software fallback provides a mechanism
> independent of the microcode version similar to L1TF.

So the SW fallback raises my hackles, because we had the exact same
kind of non-architected software fallback for clearing the data
buffers for MDS.

And that one turned out to be not only incredibly expensive, but it
didn't work reliably anyway, and was really only written for one
microarchitecture.

It's not clear that a (generic) SW routine _can_ reliably flush the
L1D$ simply because of replacement policy issues and possibly D$
cacheline management settings.

Again, I'm sure that for any particular microarchitecture, writing a
SW fallback is possible.

And I suspect that for most of them, "fill enough of the cache" ends
up making things very hard to trigger in practice, but at a very high
cost.

> >I have a hard time following whether this might all end up being
> >predicated on the STIBP static branch conditionals and might thus at
> >least be limited only to CPU's that have the problem in the first
> >place.
>
> No, this is at the moment restricted to just Intel CPUs and it is designed
> as a generic fallback mechanism for issues involving speculation and L1D
> flush for example CVE-2020-0550[1]. This mechanism addresses issues beyond
> what STIBP addresses.

Ok, so that was hard to see.

What is it that disables it exactly? I'm ok with the concept, but it
needs to be clearly targeted, and the target wasn't clear to me. In
fact, everything in the docs and changlog implied that the target was
"

> >Because I don't want a random "I can make the kernel do stupid things"
> >flag for people to opt into. I think it needs a double opt-in.
> >
>
> Are you happy for this controlled by CAP_SYS_ADMIN or are you suggesting
> a sysfs override by the administrator to completely disable this?

At a _minimum_, I want to make sure this code never possibly ever
starts flushing anything if SMT is on.

That may be the case already. The narrow twisty mazes here from
"enable this" to "do the flush" were too hard to follow for me, which
is why I'm asking for more clarification.

And yes, I did read the documentation you point to. That did not
clarify anything at all.

So to put this in very clear terms:

 - what happens if SMT is on, and a process does
"prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH_OUT, PR_SPEC_ENABLE,
0, 0);"

Because if the answer is "the kernel starts flushing the L1D at
context switches" then I refuse to pull this.

Why? Because it's just an incredibly stupid waste of time and effort
to do that, and I can see some poor hapless ssh developer saying "yes,
I should enable this thing because ssh is very special", and then ssh
just starts wasting time on something that doesn't actually help.

See what I'm saying? I want to know this feature isn't doing crazy
things. And "flush L1D$ on SMT" is crazy, since an attacker would just
sit on a sibling core and attack the L1 contents *before* the task
switch happens.

I have some other questions about this approach in the first place. I
don't see why context switch is even relevant, and why it should be
the place we flush. The kernel is trustworthy in this situation, both
before and after the context switch. So context switch makes no
difference what-so-ever from a security domain transfer angle.

Also, shouldn't we avoid flushing if you just run as the same user all
the time? IOW, context switch in itself isn't really relevant as a
security domain transfer, but it *is* relevant in the sense that
switching from one user to another is a sign of "uhhuh, now maybe I
should be careful when returning to user mode".

IOW, think of a "pipe ping-pong" test program. Set the flag for "I
want L1D$ cache flushing". Run the program with nothing else
happening, and a _good_ implementation should never ever cache-flush,
because at no point did we ever enter untrusted space: we were either
in the kernel (not just for the system calls, but for idle threads),
or we were in user context that had the right to see the data anyway.

So despite asking for a L1D$ flush on context changes, such a
ping-pong test program that does millions of context switches per
second shouldn't actually ever cause a cache flush, because there was
never any point.

IOW, what mitigations are in place for this not doing unnecessary
cache flushes, either because they are fundamentally pointless (the
machine has SMT enabled) or because they just aren't necessary (no
transition to an untrusted security domain has happened)?

And maybe those mitigations are actually there, and I just couldn't
figure it out. Some of the code scares me ("cond_ibpb()" and friends,
even if you did rename it to "cond_mitigation()").

                   Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] x86/mm changes for v5.8
  2020-06-01 21:42 ` Linus Torvalds
  2020-06-02  2:35   ` Linus Torvalds
@ 2020-06-02  7:33   ` Ingo Molnar
  2020-06-02  9:37     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2020-06-02  7:33 UTC (permalink / raw)
  To: Balbir Singh
  Cc: torvalds, a.p.zijlstra, akpm, bp, linux-kernel, luto, tglx, benh


* Balbir Singh <sblbir@amazon.com> wrote:

> > At a _minimum_, SMT being enabled should disable this kind of crazy
> > pseudo-security entirely, since it is completely pointless in that
> > situation. Scheduling simply isn't a synchronization point with SMT
> > on, so saying "sure, I'll flush the L1 at context switch" is beyond
> > stupid.
> >
> > I do not want the kernel to do things that seem to be "beyond stupid".
> >
> > Because I really think this is just PR and pseudo-security, and I
> > think there's a real cost in making people think "oh, I'm so special
> > that I should enable this".
> >
> > I'm more than happy to be educated on why I'm wrong, but for now I'm
> > unpulling it for lack of data.
> >
> > Maybe it never happens on SMT because of all those subtle static
> > branch rules, but I'd really like to that to be explained.
> 
> The documentation calls out the SMT restrictions.

That's not what Linus suggested though, and you didn't answer his 
concerns AFAICS.

The documentation commit merely mentions that this feature is useless 
with SMT:

  0fcfdf55db9e: ("Documentation: Add L1D flushing Documentation")

+Limitations
+-----------
+
+The mechanism does not mitigate L1D data leaks between tasks belonging to
+different processes which are concurrently executing on sibling threads of
+a physical CPU core when SMT is enabled on the system.
+
+This can be addressed by controlled placement of processes on physical CPU
+cores or by disabling SMT. See the relevant chapter in the L1TF mitigation
+document: :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.

Linus is right that the proper response is for this feature to do 
*nothing* if SMT is enabled on affected CPUs - but that's not 
implemented in the patches ...

Or rather, we should ask a higher level question as well, maybe we 
should not do this feature at all?

Typically cloud computing systems such as AWS will have SMT enabled, 
because cloud computing pricing is essentially per vCPU, and they want 
to sell the hyperthreads as vCPUs. So the safest solution, disabling 
SMT on affected systems, is not actually done, because it's an 
economic non-starter. (I'd like to note the security double standard 
there: the most secure option, to disable SMT, is not actually used ...)

BTW., I wonder how Amazon is solving the single-vCPU customer workload 
problem on AWS: if the vast majority of AWS computing capacity is 
running on a single vCPU, because it's the cheapest tier and because 
it's more than enough capacity to run a website. Even core-scheduling 
doesn't solve this fundamental SMT security problem: separate customer 
workloads *cannot* share the same core - but this means that the 
single-vCPU workloads will only be able to utilize 50% of all 
available vCPUs if they are properly isolated.

Or if the majority of AWS EC2 etc. customer systems are using 2,4 or 
more vCPUs, then both this feature and 'core-scheduling' is 
effectively pointless from a security POV, because the cloud computing 
systems are de-facto partitioned into cores already, with each core 
accounted as 2 vCPUs.

The hour-up-rounded way AWS (and many other cloud providers) account 
system runtime costs suggests that they are doing relatively static 
partitioning of customer workloads already, i.e. customer workloads 
are mapped to actual physical hardware in an exclusive fashion, with 
no overcommitting of physical resources and no sharing of cores 
between customers.

If I look at the pricing and capabilities table of AWS:

  https://aws.amazon.com/ec2/pricing/on-demand/

Only the 't2' and 't3' On-Demand instances have 'Variable' pricing, 
which is only 9% of the offered 228 configurations.

I.e. I strongly suspect that neither L1D flushing nor core-scheduling 
is actually required on affected vulnerable CPUs to keep customer 
workloads isolated from each other, on the majority of cloud computing 
systems, because they are already isolated via semi-static 
partitioning, using pricing that reflects static partitioning.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] x86/mm changes for v5.8
  2020-06-02  7:33   ` Ingo Molnar
@ 2020-06-02  9:37     ` Benjamin Herrenschmidt
  2020-06-02 18:28       ` Thomas Gleixner
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2020-06-02  9:37 UTC (permalink / raw)
  To: Ingo Molnar, Balbir Singh
  Cc: torvalds, a.p.zijlstra, akpm, bp, linux-kernel, luto, tglx

On Tue, 2020-06-02 at 09:33 +0200, Ingo Molnar wrote:
> Or rather, we should ask a higher level question as well, maybe we 
> should not do this feature at all?

Well, it does solve a real issue in some circumstances and there was a
reasonable discussion about this on the list that lead to it being
merged with Kees and Thomas (and others) agreeing :)

But yes, it is pointless with SMT and yes maybe we should make it
explicitly do nothing on SMT, but let's not throw the baby out with the
bath water shall we ?

> Typically cloud computing systems such as AWS will have SMT enabled, 
> because cloud computing pricing is essentially per vCPU, and they want 
> to sell the hyperthreads as vCPUs.

Not necessarily and not in every circumstances. Yes, VMs will typically
have SMT enabled. This isn't targeted at them though. One example that
was given during the discussions was containers pertaining to different
users. Now maybe we could discuss making the flush on changes of
cgroups or other similar things rather than individual process, but so
far that hasn't come up during the disucssion on the mailing list.

Another example would be a process that  handles more critical data
such as payment information, than the rest of the system and wants to
protect itself (or the admin wants that process protected) against
possible data leaks to less trusted processes.

>  So the safest solution, disabling 
> SMT on affected systems, is not actually done, because it's an 
> economic non-starter. (I'd like to note the security double standard 
> there: the most secure option, to disable SMT, is not actually used ...)

This has nothing to do about SMT, though yes maybe we should make the
patch do nothing on SMT but this isn't what this feature is about.

> BTW., I wonder how Amazon is solving the single-vCPU customer workload 
> problem on AWS: if the vast majority of AWS computing capacity is 
> running on a single vCPU, because it's the cheapest tier and because 
> it's more than enough capacity to run a website. Even core-scheduling 
> doesn't solve this fundamental SMT security problem: separate customer 
> workloads *cannot* share the same core - but this means that the 
> single-vCPU workloads will only be able to utilize 50% of all 
> available vCPUs if they are properly isolated.
> 
> Or if the majority of AWS EC2 etc. customer systems are using 2,4 or 
> more vCPUs, then both this feature and 'core-scheduling' is 
> effectively pointless from a security POV, because the cloud computing 
> systems are de-facto partitioned into cores already, with each core 
> accounted as 2 vCPUs.

AWS has more than just VMs for rent :-) There are a whole pile of
higher level "services" that our users can use and not all of them
necessarily run on VMs charged per vCPU.

> The hour-up-rounded way AWS (and many other cloud providers) account 
> system runtime costs suggests that they are doing relatively static 
> partitioning of customer workloads already, i.e. customer workloads 
> are mapped to actual physical hardware in an exclusive fashion, with 
> no overcommitting of physical resources and no sharing of cores 
> between customers.
> 
> If I look at the pricing and capabilities table of AWS:
> 
>   https://aws.amazon.com/ec2/pricing/on-demand/
> 
> Only the 't2' and 't3' On-Demand instances have 'Variable' pricing, 
> which is only 9% of the offered 228 configurations.
> 
> I.e. I strongly suspect that neither L1D flushing nor core-scheduling 
> is actually required on affected vulnerable CPUs to keep customer 
> workloads isolated from each other, on the majority of cloud computing 
> systems, because they are already isolated via semi-static 
> partitioning, using pricing that reflects static partitioning.

This isn't about that. These patches aren't trying to solve problems
happening inside of a customer VM running SMT not are they about
protecting VMs against other VMs on the same system.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] x86/mm changes for v5.8
  2020-06-02  2:35   ` Linus Torvalds
@ 2020-06-02 10:25     ` Singh, Balbir
  0 siblings, 0 replies; 15+ messages in thread
From: Singh, Balbir @ 2020-06-02 10:25 UTC (permalink / raw)
  To: torvalds; +Cc: mingo, linux-kernel, a.p.zijlstra, tglx, akpm, luto, bp, benh

On Mon, 2020-06-01 at 19:35 -0700, Linus Torvalds wrote:
> 
> On Mon, Jun 1, 2020 at 6:06 PM Balbir Singh <sblbir@amazon.com> wrote:
> > 
> > I think apps can do this independently today as in do the flush
> > via software fallback in the app themselves.
> 
> Sure, but they can't force the kernel to do crazy things for every task switch.

On every mm switch, yes :) 

> 
> > But I see your concern with
> > respect to an overall admin override, I thought about it, but I stopped
> > because of the reason above.
> 
> So my real concern is that I had a hard time following all the logic,
> but it _looked_ to me like it ends up doing the flushes even if SMT is
> enabled, for example.
> 
> And that's not an "admin override". That's just a full no-no. If SMT
> is on, then we don't do any L1D$ flushing in the kernel, because it
> would ne entirely pointless.

Just to clarify SMT as in SMT on the current core, not the entire system.

> 
> But this all quite possibly interacts with all the subtle static
> branches we have in this same area, so ...
> 
> > The software fallback was there and is the same algorithm that we use
> > for L1TF. We could potentially remove the software fallback if there
> > is a concern, having the software fallback provides a mechanism
> > independent of the microcode version similar to L1TF.
> 
> So the SW fallback raises my hackles, because we had the exact same
> kind of non-architected software fallback for clearing the data
> buffers for MDS.
> 
> And that one turned out to be not only incredibly expensive, but it
> didn't work reliably anyway, and was really only written for one
> microarchitecture.
> 
> It's not clear that a (generic) SW routine _can_ reliably flush the
> L1D$ simply because of replacement policy issues and possibly D$
> cacheline management settings.
> 
> Again, I'm sure that for any particular microarchitecture, writing a
> SW fallback is possible.
> 
> And I suspect that for most of them, "fill enough of the cache" ends
> up making things very hard to trigger in practice, but at a very high
> cost.

Yes, I agree with that, we could turn off the fallback if that helps
and we can have this feature enabled for when we have the MSR to do the
L1D flush.

> 
> > > I have a hard time following whether this might all end up being
> > > predicated on the STIBP static branch conditionals and might thus at
> > > least be limited only to CPU's that have the problem in the first
> > > place.
> > 
> > No, this is at the moment restricted to just Intel CPUs and it is designed
> > as a generic fallback mechanism for issues involving speculation and L1D
> > flush for example CVE-2020-0550[1]. This mechanism addresses issues beyond
> > what STIBP addresses.
> 
> Ok, so that was hard to see.
> 
> What is it that disables it exactly? I'm ok with the concept, but it
> needs to be clearly targeted, and the target wasn't clear to me. In
> fact, everything in the docs and changlog implied that the target was
> "

It addresses issues associated with "Snoop-Assisted L1D Sampling". Is that
what you referred to as target?

> 
> > > Because I don't want a random "I can make the kernel do stupid things"
> > > flag for people to opt into. I think it needs a double opt-in.
> > > 
> > 
> > Are you happy for this controlled by CAP_SYS_ADMIN or are you suggesting
> > a sysfs override by the administrator to completely disable this?
> 
> At a _minimum_, I want to make sure this code never possibly ever
> starts flushing anything if SMT is on.
> 
> That may be the case already. The narrow twisty mazes here from
> "enable this" to "do the flush" were too hard to follow for me, which
> is why I'm asking for more clarification.
> 
> And yes, I did read the documentation you point to. That did not
> clarify anything at all.
> 
> So to put this in very clear terms:
> 
>  - what happens if SMT is on, and a process does
> "prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH_OUT, PR_SPEC_ENABLE,
> 0, 0);"
> 
> Because if the answer is "the kernel starts flushing the L1D at
> context switches" then I refuse to pull this.
> 
> Why? Because it's just an incredibly stupid waste of time and effort
> to do that, and I can see some poor hapless ssh developer saying "yes,
> I should enable this thing because ssh is very special", and then ssh
> just starts wasting time on something that doesn't actually help.
> 
> See what I'm saying? I want to know this feature isn't doing crazy
> things. And "flush L1D$ on SMT" is crazy, since an attacker would just
> sit on a sibling core and attack the L1 contents *before* the task
> switch happens.
> 
> I have some other questions about this approach in the first place. I
> don't see why context switch is even relevant, and why it should be
> the place we flush. The kernel is trustworthy in this situation, both
> before and after the context switch. So context switch makes no
> difference what-so-ever from a security domain transfer angle.
> 
> Also, shouldn't we avoid flushing if you just run as the same user all
> the time? IOW, context switch in itself isn't really relevant as a
> security domain transfer, but it *is* relevant in the sense that
> switching from one user to another is a sign of "uhhuh, now maybe I
> should be careful when returning to user mode".
>
 
The example you list above, will cause flushing if mm context switches
on the core. I've tested the implementation with such a program and seen
context switches only when another task (different mm) is on the same
CPU as the one with the program that needs to flush. Normally on an idle
system where nothing else with a different mm gets scheduled on the same
core as the program wanting to flush L1D, I see no flushes.

On the user front, I am not so sure. A user can host multiple tasks and
if one of them was compromised, it would be bad to let it allow the leak
to happen. For example if the plugin in a browser could leak a security
key of a secure session, that would be bad.

> IOW, think of a "pipe ping-pong" test program. Set the flag for "I
> want L1D$ cache flushing". Run the program with nothing else
> happening, and a _good_ implementation should never ever cache-flush,
> because at no point did we ever enter untrusted space: we were either
> in the kernel (not just for the system calls, but for idle threads),
> or we were in user context that had the right to see the data anyway.
> 
> So despite asking for a L1D$ flush on context changes, such a
> ping-pong test program that does millions of context switches per
> second shouldn't actually ever cause a cache flush, because there was
> never any point.
> 
> IOW, what mitigations are in place for this not doing unnecessary
> cache flushes, either because they are fundamentally pointless (the
> machine has SMT enabled) or because they just aren't necessary (no
> transition to an untrusted security domain has happened)?

We tried to bring them down quite a bit by applying the flush on
mm switches, we could bring it down by applying a SMT filter for the core
(core is running single threaded).

> 
> And maybe those mitigations are actually there, and I just couldn't
> figure it out. Some of the code scares me ("cond_ibpb()" and friends,
> even if you did rename it to "cond_mitigation()").
> 

I reused some of those bits from earlier code, largely because of the overlap
with the use case of flushing on mm context change.

It sounds like changing the flush to do the following:

Avoiding the L1D flush if the current core is in SMT mode would address
your concern?

For user space considerations prior to flushing, I provided some arguments
above.

Balbir Singh.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] x86/mm changes for v5.8
  2020-06-02  9:37     ` Benjamin Herrenschmidt
@ 2020-06-02 18:28       ` Thomas Gleixner
  2020-06-02 19:14         ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2020-06-02 18:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Ingo Molnar, Balbir Singh
  Cc: torvalds, a.p.zijlstra, akpm, bp, linux-kernel, luto

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:
> On Tue, 2020-06-02 at 09:33 +0200, Ingo Molnar wrote:
>> Or rather, we should ask a higher level question as well, maybe we 
>> should not do this feature at all?
>
> Well, it does solve a real issue in some circumstances and there was a
> reasonable discussion about this on the list that lead to it being
> merged with Kees and Thomas (and others) agreeing :)
>
> But yes, it is pointless with SMT and yes maybe we should make it
> explicitly do nothing on SMT, but let's not throw the baby out with the
> bath water shall we ?

It's trivial enough to fix. We have a static key already which is
telling us whether SMT scheduling is active.

Thanks,

        tglx
---
 arch/x86/mm/tlb.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -8,6 +8,7 @@
 #include <linux/export.h>
 #include <linux/cpu.h>
 #include <linux/debugfs.h>
+#include <linux/sched/smt.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -457,7 +458,7 @@ static void cond_mitigation(struct task_
 			indirect_branch_prediction_barrier();
 	}
 
-	if (prev_mm & LAST_USER_MM_L1D_FLUSH) {
+	if (!sched_smt_active() && prev_mm & LAST_USER_MM_L1D_FLUSH) {
 		/*
 		 * Don't populate the TLB for the software fallback flush.
 		 * Populate TLB is not needed for this use case.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] x86/mm changes for v5.8
  2020-06-02 18:28       ` Thomas Gleixner
@ 2020-06-02 19:14         ` Linus Torvalds
  2020-06-02 23:01           ` Singh, Balbir
  2020-06-04 17:29           ` [GIT PULL v2] " Ingo Molnar
  0 siblings, 2 replies; 15+ messages in thread
From: Linus Torvalds @ 2020-06-02 19:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Benjamin Herrenschmidt, Ingo Molnar, Balbir Singh,
	Peter Zijlstra, Andrew Morton, Borislav Petkov,
	Linux Kernel Mailing List, Andrew Lutomirski

On Tue, Jun 2, 2020 at 11:29 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> It's trivial enough to fix. We have a static key already which is
> telling us whether SMT scheduling is active.

.. but should we do it here, in switch_mm() in the first place?

Should it perhaps just return an error if user land tries to set the
"flush L1$" thing on an SMT system? And no, I don't think we care at
all if people then start playing games and enabling/disabling SMT
dynamically while applications are working. At that point the admin
kets to keep both of the broken pieces.

Also, see my other point about how "switch_mm()" really isn't even a
protection domain switch to begin with. We're still in the exact same
protection domain we used to be in, with the exact same direct access
to L1D$.

Why would we flush the caches on a random and irrelevant event in
kernel space? switch_mm() simply isn't relevant for caches (well,
unless you have fully virtual ones, but that's a completely different
issue).

Wouldn't it be more sensible to treat it more like TIF_NEED_FPU_LOAD -
have a per-cpu "I need to flush the cache" variable, and then the only
thing a context switch does is to see if the user changed (or
whatever) and then set the bit, and set TIF_NOTIFY_RESUME in the
thread.

Because the L1D$ flush isn't a kernel issue, it's a "don't let user
space try to attack it" issue. The kernel can already read it if it
wants to.

And that's just the "big picture" issues I see. In the big picture,
doing this when SMT is enabled is unbelievably stupid. And in the big
picture, context switch really isn't a security domain change wrt the
L1D$.

The more I look at those patches, the more I go "that's just wrong" on
some of the "small picture" implementation details.

Here's just a few cases that I reacted to

Actual low-level flushing code:

 (1) the SW fallback

     (a) is only defined on Intel, and initializing the silly cache
flush pages on any other vendor will fail.

     (b) seems to assume that 16 pages (order-4) is sufficient and
necessary. Probably "good enough", but it's also an example of "yeah,
that's expensive".

     (c) and if I read the code correctly, trying to flush the L1D$ on
non-intel without the HW support, it causes a WARN_ON_ONCE()! WTF?

 (2) the HW case is done for any vendor, if it reports the "I have the MSR"

 (3) the VMX support certainly has various sanity checks like "oh, CPU
doesn't have X86_BUG_L1TF, then I won't do this even if there was some
kernel command line to say I should". But the new prctrl doesn't have
anything like that. It just enables that L1D$ thing mindlessly,
thinking that user-land software somehow knows what it's doing. BS.

 (4) what does L1D_FLUSH_POPULATE_TLB mean?

   That "option" makes zero sense. It pre-populates the TLB before
doing the accesses to the L1D$ pages in the SW case, but nothing at
all explains why that is needed. It's clearly not needed for the
caller, since the TLB population only happens for the SW fallback, not
for the HW one.

   No documentation, no nothing. It's enabled for the VMX case, not
for the non-vmx case, which makes me suspect it's some crazy "work
around vm monitor page faults, because we know our SW flush fallback
is just random garbage".

In other words, there are those big "high-level design" questions, but
also several oddities in just the implementation details.

I really get the feeling that this feature just isn't ready.

Ingo - would you mind sending me a pull request for the (independent)
TLB cleanups part of that x86/mm tree? Because everything up to and
including commit bd1de2a7aace ("x86/tlb/uv: Add a forward declaration
for struct flush_tlb_info") looks sane.

It's only this L1D$ flushing thing at the end of that branch that I
think isn't fully cooked.

               Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re:  [GIT PULL] x86/mm changes for v5.8
  2020-06-02 19:14         ` Linus Torvalds
@ 2020-06-02 23:01           ` Singh, Balbir
  2020-06-02 23:28             ` Linus Torvalds
  2020-06-04 17:29           ` [GIT PULL v2] " Ingo Molnar
  1 sibling, 1 reply; 15+ messages in thread
From: Singh, Balbir @ 2020-06-02 23:01 UTC (permalink / raw)
  To: torvalds, tglx
  Cc: mingo, linux-kernel, keescook, a.p.zijlstra, akpm, luto, bp, benh

On Tue, 2020-06-02 at 12:14 -0700, Linus Torvalds wrote:
> 
> On Tue, Jun 2, 2020 at 11:29 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > It's trivial enough to fix. We have a static key already which is
> > telling us whether SMT scheduling is active.
> 
> .. but should we do it here, in switch_mm() in the first place?
> 
> Should it perhaps just return an error if user land tries to set the
> "flush L1$" thing on an SMT system? And no, I don't think we care at
> all if people then start playing games and enabling/disabling SMT
> dynamically while applications are working. At that point the admin
> kets to keep both of the broken pieces.
> 
> Also, see my other point about how "switch_mm()" really isn't even a
> protection domain switch to begin with. We're still in the exact same
> protection domain we used to be in, with the exact same direct access
> to L1D$.
> 
> Why would we flush the caches on a random and irrelevant event in
> kernel space? switch_mm() simply isn't relevant for caches (well,
> unless you have fully virtual ones, but that's a completely different
> issue).
> 
> Wouldn't it be more sensible to treat it more like TIF_NEED_FPU_LOAD -
> have a per-cpu "I need to flush the cache" variable, and then the only
> thing a context switch does is to see if the user changed (or
> whatever) and then set the bit, and set TIF_NOTIFY_RESUME in the
> thread.
> 
> Because the L1D$ flush isn't a kernel issue, it's a "don't let user
> space try to attack it" issue. The kernel can already read it if it
> wants to.
> 
> And that's just the "big picture" issues I see. In the big picture,
> doing this when SMT is enabled is unbelievably stupid. And in the big
> picture, context switch really isn't a security domain change wrt the
> L1D$.
>

+ Kees (sorry my bad, I should have added him earlier)

I am going to take a step back and point to

https://software.intel.com/security-software-guidance/software-guidance/snoop-assisted-l1-data-sampling
https://software.intel.com/security-software-guidance/insights/deep-dive-snoop-assisted-l1-data-sampling

The suggested mitigation is

"Snoop-assisted L1D sampling could be mitigated by flushing the L1D cache
before executing potentially malicious applications"

We discussed the mitigations in an RFC

https://lore.kernel.org/lkml/20200313220415.856-1-sblbir@amazon.com/

switch_mm() was chosen as a trade-off between, how long do we keep the data
in the cache vs how frequently do we flush. What your suggesting is that we
use switch_mm() + return to user mode to decide when the security domain
changes?

 
> The more I look at those patches, the more I go "that's just wrong" on
> some of the "small picture" implementation details.
> 
> Here's just a few cases that I reacted to
> 
> Actual low-level flushing code:
> 
>  (1) the SW fallback
> 
>      (a) is only defined on Intel, and initializing the silly cache
> flush pages on any other vendor will fail.
> 
>      (b) seems to assume that 16 pages (order-4) is sufficient and
> necessary. Probably "good enough", but it's also an example of "yeah,
> that's expensive".
> 

The software flush is largely code reuse assuming the L1TF bits
were right

>      (c) and if I read the code correctly, trying to flush the L1D$ on
> non-intel without the HW support, it causes a WARN_ON_ONCE()! WTF?

That is not correct, the function only complains if we do a software fallback
flush without allocating the flush pages. That function is not exposed without
the user using the prctl() API, which allocates those flush pages. The other
user of this API is the L1TF flush logic

> 
>  (2) the HW case is done for any vendor, if it reports the "I have the MSR"

No l1d_flush_init_once() fails for users opting in via the prctl(), it
succeeds for users of L1TF.


> 
>  (3) the VMX support certainly has various sanity checks like "oh, CPU
> doesn't have X86_BUG_L1TF, then I won't do this even if there was some
> kernel command line to say I should". But the new prctrl doesn't have
> anything like that. It just enables that L1D$ thing mindlessly,
> thinking that user-land software somehow knows what it's doing. BS.

So you'd like to see a double opt-in? Unforunately there is no gating
of the bug and I tried to make it generic - clearly calling it opt-in
flushing for the paranoid, for those who really care about CVE-2020-0550.

> 
>  (4) what does L1D_FLUSH_POPULATE_TLB mean?
> 
>    That "option" makes zero sense. It pre-populates the TLB before
> doing the accesses to the L1D$ pages in the SW case, but nothing at
> all explains why that is needed. It's clearly not needed for the
> caller, since the TLB population only happens for the SW fallback, not
> for the HW one.

Good question, I asked around in the RFC and in my email threads as
to why we needed this even for the L1TF case with no response. I decided
not to do it, unless I fully understood why it's needed.

I am happy to remove the SW fallback if needed.

> 
>    No documentation, no nothing. It's enabled for the VMX case, not
> for the non-vmx case, which makes me suspect it's some crazy "work
> around vm monitor page faults, because we know our SW flush fallback
> is just random garbage".
>

Would this make you happier?

1. Remove SW fallback flush
2. Implement a double opt-in (CAP_SYS_ADMIN for the prctl or a
   system wide disable)?
3. Ensure the flush happens only when the current core has
   SMT disabled

w.r.t. switch_mm() vs another place to flush, it is a trade-off,
would 1 to 3 convince you?

In summary there was a discussion, two RFCs and the patches were reviewed.
You'd like to see further filters to the flush, which is covered in points
1 to 3 above.

Balbir Singh.
 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] x86/mm changes for v5.8
  2020-06-02 23:01           ` Singh, Balbir
@ 2020-06-02 23:28             ` Linus Torvalds
  2020-06-03  1:31               ` Singh, Balbir
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2020-06-02 23:28 UTC (permalink / raw)
  To: Singh, Balbir
  Cc: tglx, mingo, linux-kernel, keescook, a.p.zijlstra, akpm, luto, bp, benh

On Tue, Jun 2, 2020 at 4:01 PM Singh, Balbir <sblbir@amazon.com> wrote:
>
> >      (c) and if I read the code correctly, trying to flush the L1D$ on
> > non-intel without the HW support, it causes a WARN_ON_ONCE()! WTF?
>
> That is not correct, the function only complains if we do a software fallback
> flush without allocating the flush pages.

Right.

And if you're not on Intel, then that allocation would never have been
done, since the allocation function returns an error for non-intel
systems.

> That function is not exposed without
> the user using the prctl() API, which allocates those flush pages.

See above: it doesn't actually allocate those pages on anything but intel CPU's.

That said, looking deeper, it then does look like a
l1d_flush_init_once() failure will also cause the code to avoid
setting the TIF_SPEC_L1D_FLUSH bit, so non-intel CPU's will never call
the actual flushing routines, and thus never hit the WARN_ON. Ok.

> >  (2) the HW case is done for any vendor, if it reports the "I have the MSR"
>
> No l1d_flush_init_once() fails for users opting in via the prctl(), it
> succeeds for users of L1TF.

Yeah, again it looks like this all is basically just a hack for Intel CPU's.

It should never have been conditional on "do this on Intel".

It should have been conditional on the L1TF bug.

Yes, there's certainly overlap there, but it's not complete.

> >  (3) the VMX support certainly has various sanity checks like "oh, CPU
> > doesn't have X86_BUG_L1TF, then I won't do this even if there was some
> > kernel command line to say I should". But the new prctrl doesn't have
> > anything like that. It just enables that L1D$ thing mindlessly,
> > thinking that user-land software somehow knows what it's doing. BS.
>
> So you'd like to see a double opt-in?

I'd like it to be gated on being sane by default, together with some
system option like we have for pretty much all the mitigations.

>     Unforunately there is no gating
> of the bug and I tried to make it generic - clearly calling it opt-in
> flushing for the paranoid, for those who really care about CVE-2020-0550.

No, you didn't make it generic at all - you made it depend on
X86_VENDOR_INTEL instead.

So now the logic is "on Intel, do this thing whether it makes sense or
not, on other vendors, never do it whether it _would_ make sense or
not".

That to me is not sensible. I just don't see the logic.

This feature should never be enabled unless X86_BUG_L1TF is on, as far
as I can tell.

And it should never be enabled if SMT is on.

At that point, it at least starts making sense. Maybe we don't need
any further admin options at that point.

> Would this make you happier?
>
> 1. Remove SW fallback flush
> 2. Implement a double opt-in (CAP_SYS_ADMIN for the prctl or a
>    system wide disable)?
> 3. Ensure the flush happens only when the current core has
>    SMT disabled

I think that (3) case should basically be "X86_BUG_L1TF && !SMT". That
should basically be the default setting for this.

The (2) thing I would prefer to just be the same kind of thing we do
for all the other mitigations: have a kernel command line to override
the defaults.

The SW fallback right now feels wrong to me. It does seem to be very
microarchitecture-specific and I'd really like to understand the
reason for the magic TLB filling. At the same time, if the feature is
at least enabled under sane and understandable circumstances, and
people have a way to turn it off, maybe I don't care too much.

                  Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL] x86/mm changes for v5.8
  2020-06-02 23:28             ` Linus Torvalds
@ 2020-06-03  1:31               ` Singh, Balbir
  0 siblings, 0 replies; 15+ messages in thread
From: Singh, Balbir @ 2020-06-03  1:31 UTC (permalink / raw)
  To: torvalds
  Cc: mingo, linux-kernel, keescook, a.p.zijlstra, tglx, akpm, luto, bp, benh

On Tue, 2020-06-02 at 16:28 -0700, Linus Torvalds wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Jun 2, 2020 at 4:01 PM Singh, Balbir <sblbir@amazon.com> wrote:
> > 
> > >      (c) and if I read the code correctly, trying to flush the L1D$ on
> > > non-intel without the HW support, it causes a WARN_ON_ONCE()! WTF?
> > 
> > That is not correct, the function only complains if we do a software fallback
> > flush without allocating the flush pages.
> 
> Right.
> 
> And if you're not on Intel, then that allocation would never have been
> done, since the allocation function returns an error for non-intel
> systems.
> 
> > That function is not exposed without
> > the user using the prctl() API, which allocates those flush pages.
> 
> See above: it doesn't actually allocate those pages on anything but intel CPU's.
> 
> That said, looking deeper, it then does look like a
> l1d_flush_init_once() failure will also cause the code to avoid
> setting the TIF_SPEC_L1D_FLUSH bit, so non-intel CPU's will never call
> the actual flushing routines, and thus never hit the WARN_ON. Ok.
> 
> > >  (2) the HW case is done for any vendor, if it reports the "I have the MSR"
> > 
> > No l1d_flush_init_once() fails for users opting in via the prctl(), it
> > succeeds for users of L1TF.
> 
> Yeah, again it looks like this all is basically just a hack for Intel CPU's.
> 
> It should never have been conditional on "do this on Intel".
> 
> It should have been conditional on the L1TF bug.
> 
> Yes, there's certainly overlap there, but it's not complete.
> 
> > >  (3) the VMX support certainly has various sanity checks like "oh, CPU
> > > doesn't have X86_BUG_L1TF, then I won't do this even if there was some
> > > kernel command line to say I should". But the new prctrl doesn't have
> > > anything like that. It just enables that L1D$ thing mindlessly,
> > > thinking that user-land software somehow knows what it's doing. BS.
> > 
> > So you'd like to see a double opt-in?
> 
> I'd like it to be gated on being sane by default, together with some
> system option like we have for pretty much all the mitigations.
> 
> >     Unforunately there is no gating
> > of the bug and I tried to make it generic - clearly calling it opt-in
> > flushing for the paranoid, for those who really care about CVE-2020-0550.
> 
> No, you didn't make it generic at all - you made it depend on
> X86_VENDOR_INTEL instead.
> 
> So now the logic is "on Intel, do this thing whether it makes sense or
> not, on other vendors, never do it whether it _would_ make sense or
> not".
> 
> That to me is not sensible. I just don't see the logic.
> 
> This feature should never be enabled unless X86_BUG_L1TF is on, as far
> as I can tell.
> 
> And it should never be enabled if SMT is on.
> 
> At that point, it at least starts making sense. Maybe we don't need
> any further admin options at that point.
> 
> > Would this make you happier?
> > 
> > 1. Remove SW fallback flush
> > 2. Implement a double opt-in (CAP_SYS_ADMIN for the prctl or a
> >    system wide disable)?
> > 3. Ensure the flush happens only when the current core has
> >    SMT disabled
> 
> I think that (3) case should basically be "X86_BUG_L1TF && !SMT". That
> should basically be the default setting for this.
> 
> The (2) thing I would prefer to just be the same kind of thing we do
> for all the other mitigations: have a kernel command line to override
> the defaults.
> 
> The SW fallback right now feels wrong to me. It does seem to be very
> microarchitecture-specific and I'd really like to understand the
> reason for the magic TLB filling. At the same time, if the feature is
> at least enabled under sane and understandable circumstances, and
> people have a way to turn it off, maybe I don't care too much.
>

I cooked up a quick patch (yet untested patch, which leaves the current
refactoring as is) for comments. This should hopefully address your concerns.
This is not the final patch, just the approach for the line of thinking
so far.


diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index a58360c8e6e8..988a9d0c31ec 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -293,6 +293,13 @@ enum taa_mitigations {
 	TAA_MITIGATION_TSX_DISABLED,
 };
 
+enum l1d_flush_out_mitigations {
+	L1D_FLUSH_OUT_OFF,
+	L1D_FLUSH_OUT_ON,
+};
+
+static enum l1d_flush_out_mitigations l1d_flush_out_mitigation __ro_after_init = L1D_FLUSH_OUT_ON;
+
 /* Default mitigation for TAA-affected CPUs */
 static enum taa_mitigations taa_mitigation __ro_after_init = TAA_MITIGATION_VERW;
 static bool taa_nosmt __ro_after_init;
@@ -376,6 +383,18 @@ static void __init taa_select_mitigation(void)
 	pr_info("%s\n", taa_strings[taa_mitigation]);
 }
 
+static int __init l1d_flush_out_parse_cmdline(char *str)
+{
+	if (!boot_cpu_has_bug(X86_BUG_L1TF))
+		return 0;
+
+	if (!strcmp(str, "off"))
+		l1d_flush_out_mitigation = L1D_FLUSH_OUT_OFF;
+
+	return 0;
+}
+early_param("l1d_flush_out", l1d_flush_out_parse_cmdline);
+
 static int __init tsx_async_abort_parse_cmdline(char *str)
 {
 	if (!boot_cpu_has_bug(X86_BUG_TAA))
@@ -1123,6 +1142,10 @@ static void task_update_spec_tif(struct task_struct *tsk)
 
 static int l1d_flush_out_prctl_set(struct task_struct *task, unsigned long ctrl)
 {
+
+	if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
+		return -EPERM;
+
 	switch (ctrl) {
 	case PR_SPEC_ENABLE:
 		return enable_l1d_flush_for_task(task);
@@ -1240,6 +1263,9 @@ static int l1d_flush_out_prctl_get(struct task_struct *task)
 {
 	int ret;
 
+	if (l1d_flush_out_mitigation == L1D_FLUSH_OUT_OFF)
+		return PR_SPEC_FORCE_DISABLE;
+
 	ret = test_ti_thread_flag(&task->thread_info, TIF_SPEC_L1D_FLUSH);
 	if (ret)
 		return PR_SPEC_PRCTL | PR_SPEC_ENABLE;
diff --git a/arch/x86/kernel/l1d_flush.c b/arch/x86/kernel/l1d_flush.c
index 4662f90fa321..0b898c1b76cd 100644
--- a/arch/x86/kernel/l1d_flush.c
+++ b/arch/x86/kernel/l1d_flush.c
@@ -89,9 +89,6 @@ int l1d_flush_init_once(void)
 {
 	int ret = 0;
 
-	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
-		return -ENOTSUPP;
-
 	if (static_cpu_has(X86_FEATURE_FLUSH_L1D) || l1d_flush_pages)
 		return ret;
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ff1ff8c83452..ddab8166a474 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -320,12 +320,29 @@ EXPORT_SYMBOL_GPL(leave_mm);
 
 int enable_l1d_flush_for_task(struct task_struct *tsk)
 {
-	int ret = l1d_flush_init_once();
+	int cpu = get_cpu();
+	int ret;
+
+	/*
+	 * Do not enable L1D_FLUSH_OUT if
+	 * a. The current core has SMT enabled
+	 * b. The CPU is not affected by the L1TF bug
+	 * c. The CPU does not have L1D FLUSH feature support
+	 */
+	if ((cpumask_weight(topology_sibling_cpumask(cpu)) != 1) ||
+			!boot_cpu_has_bug(X86_BUG_L1TF) ||
+			!static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+		ret = -EINVAL;
+		goto done;
+	}
 
+	ret = l1d_flush_init_once();
 	if (ret < 0)
-		return ret;
+		goto done;
 
 	set_ti_thread_flag(&tsk->thread_info, TIF_SPEC_L1D_FLUSH);
+done:
+	put_cpu();
 	return ret;
 }
 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [GIT PULL v2] x86/mm changes for v5.8
  2020-06-02 19:14         ` Linus Torvalds
  2020-06-02 23:01           ` Singh, Balbir
@ 2020-06-04 17:29           ` Ingo Molnar
  2020-06-05  2:41             ` Linus Torvalds
  1 sibling, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2020-06-04 17:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Benjamin Herrenschmidt, Balbir Singh,
	Peter Zijlstra, Andrew Morton, Borislav Petkov,
	Linux Kernel Mailing List, Andrew Lutomirski


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I really get the feeling that this feature just isn't ready.
> 
> Ingo - would you mind sending me a pull request for the (independent)
> TLB cleanups part of that x86/mm tree? Because everything up to and
> including commit bd1de2a7aace ("x86/tlb/uv: Add a forward declaration
> for struct flush_tlb_info") looks sane.

Yeah, agreed, in fact we moved x86/mm back to bd1de2a7aace almost 
immediately and put the L1D bits into WIP.x86/mm. Fortunately they are 
on top of the rest, as you noted, so it didn't have to be rebased.

> It's only this L1D$ flushing thing at the end of that branch that I 
> think isn't fully cooked.

Yeah, sure - here's the updated pull request for the rest:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-mm-2020-06-04

   # HEAD: bd1de2a7aace4d1d312fb1be264b8fafdb706208 x86/tlb/uv: Add a forward declaration for struct flush_tlb_info

Misc changes:

 - Unexport various PAT primitives

 - Unexport per-CPU tlbstate

 Thanks,

	Ingo

------------------>
Borislav Petkov (1):
      x86/tlb/uv: Add a forward declaration for struct flush_tlb_info

Christoph Hellwig (5):
      x86/mm: Add a x86_has_pat_wp() helper
      x86/mm: Move pgprot2cachemode out of line
      x86/mm: Cleanup pgprot_4k_2_large() and pgprot_large_2_4k()
      x86/mm: Unexport __cachemode2pte_tbl
      x86/mm: Use pgprotval_t in protval_4k_2_large() and protval_large_2_4k()

Thomas Gleixner (17):
      x86/tlb: Uninline __get_current_cr3_fast()
      x86/cpu: Uninline CR4 accessors
      x86/cr4: Sanitize CR4.PCE update
      x86/alternatives: Move temporary_mm helpers into C
      x86/tlb: Move __flush_tlb() out of line
      x86/tlb: Move __flush_tlb_global() out of line
      x86/tlb: Move __flush_tlb_one_user() out of line
      x86/tlb: Move __flush_tlb_one_kernel() out of line
      x86/tlb: Move flush_tlb_others() out of line
      x86/tlb: Move __flush_tlb_all() out of line
      x86/tlb: Move paravirt_tlb_remove_table() to the usage site
      x86/tlb: Move cr4_set_bits_and_update_boot() to the usage site
      x86/tlb: Uninline nmi_uaccess_okay()
      x86/tlb: Move PCID helpers where they are used
      xen/privcmd: Remove unneeded asm/tlb.h include
      x86/tlb: Restrict access to tlbstate
      x86/cpu: Export native_write_cr4() only when CONFIG_LKTDM=m


 arch/x86/events/core.c               |  11 +-
 arch/x86/include/asm/memtype.h       |   3 +
 arch/x86/include/asm/mmu_context.h   |  88 +------
 arch/x86/include/asm/paravirt.h      |  12 +-
 arch/x86/include/asm/pgtable_32.h    |   2 +-
 arch/x86/include/asm/pgtable_types.h |  44 +---
 arch/x86/include/asm/tlbflush.h      | 441 ++++-------------------------------
 arch/x86/include/asm/uv/uv.h         |   1 +
 arch/x86/kernel/alternative.c        |  55 +++++
 arch/x86/kernel/cpu/common.c         |  25 +-
 arch/x86/kernel/cpu/mtrr/generic.c   |   4 +-
 arch/x86/kernel/paravirt.c           |  21 +-
 arch/x86/kernel/process.c            |  11 +
 arch/x86/mm/init.c                   |  44 +++-
 arch/x86/mm/init_64.c                |   4 +-
 arch/x86/mm/ioremap.c                |  10 +-
 arch/x86/mm/kmmio.c                  |   2 +-
 arch/x86/mm/mem_encrypt.c            |   2 +-
 arch/x86/mm/pat/set_memory.c         |   7 +-
 arch/x86/mm/pgtable.c                |  16 +-
 arch/x86/mm/pgtable_32.c             |   2 +-
 arch/x86/mm/tlb.c                    | 384 +++++++++++++++++++++++++++++-
 arch/x86/platform/uv/tlb_uv.c        |   4 +-
 drivers/xen/privcmd.c                |   1 -
 24 files changed, 608 insertions(+), 586 deletions(-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL v2] x86/mm changes for v5.8
  2020-06-04 17:29           ` [GIT PULL v2] " Ingo Molnar
@ 2020-06-05  2:41             ` Linus Torvalds
  2020-06-05  8:11               ` [GIT PULL v3] " Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2020-06-05  2:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Benjamin Herrenschmidt, Balbir Singh,
	Peter Zijlstra, Andrew Morton, Borislav Petkov,
	Linux Kernel Mailing List, Andrew Lutomirski

On Thu, Jun 4, 2020 at 10:29 AM Ingo Molnar <mingo@kernel.org> wrote:
>
> Yeah, sure - here's the updated pull request for the rest:
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-mm-2020-06-04
>
>    # HEAD: bd1de2a7aace4d1d312fb1be264b8fafdb706208 x86/tlb/uv: Add a forward declaration for struct flush_tlb_info

Nope, that still points to

 0fcfdf55db9e1ecf85edd6aa8d0bc78a448cb96a Documentation: Add L1D
flushing Documentation

although it looks like the 'x86/mm' _branch_ does point to that commit
bd1de2a7aace.

You did something odd where you created a new tag, but used the old branch. Hmm?

                Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [GIT PULL v3] x86/mm changes for v5.8
  2020-06-05  2:41             ` Linus Torvalds
@ 2020-06-05  8:11               ` Ingo Molnar
  2020-06-05 20:40                 ` pr-tracker-bot
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2020-06-05  8:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Benjamin Herrenschmidt, Balbir Singh,
	Peter Zijlstra, Andrew Morton, Borislav Petkov,
	Linux Kernel Mailing List, Andrew Lutomirski


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Jun 4, 2020 at 10:29 AM Ingo Molnar <mingo@kernel.org> wrote:
> >
> > Yeah, sure - here's the updated pull request for the rest:
> >
> >    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-mm-2020-06-04
> >
> >    # HEAD: bd1de2a7aace4d1d312fb1be264b8fafdb706208 x86/tlb/uv: Add a forward declaration for struct flush_tlb_info
> 
> Nope, that still points to
> 
>  0fcfdf55db9e1ecf85edd6aa8d0bc78a448cb96a Documentation: Add L1D
> flushing Documentation
> 
> although it looks like the 'x86/mm' _branch_ does point to that commit
> bd1de2a7aace.
> 
> You did something odd where you created a new tag, but used the old branch. Hmm?

Oops, my apologies, indeed I did something odd with tags: my testing 
script used a x86-mm-for-linus temporary branch to construct the tag 
and test the tree, and merged x86/mm into it - but this branch was 
still a leftover from the previous pull request and the merge of 
x86/mm into it succeeded without an error ...

In a comedy of errors, while it used the out of date x86-mm-for-linus 
to generate the tag, it consistently used the correct x86/mm branch 
for the pull request itself, and thus my fail-safes to catch such 
mishaps of sending the wrong tree to you all failed. :-/

I fixed my tag workflow, re-tested this case (and a few others) to 
make sure it doesn't happen again, and find below the updated v3 pull 
request. Sorry about this!

---
Linus,

Please pull the latest x86/mm git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-mm-2020-06-05

   # HEAD: bd1de2a7aace4d1d312fb1be264b8fafdb706208 x86/tlb/uv: Add a forward declaration for struct flush_tlb_info

Misc changes:

 - Unexport various PAT primitives

 - Unexport per-CPU tlbstate

 Thanks,

	Ingo

------------------>
Borislav Petkov (1):
      x86/tlb/uv: Add a forward declaration for struct flush_tlb_info

Christoph Hellwig (5):
      x86/mm: Add a x86_has_pat_wp() helper
      x86/mm: Move pgprot2cachemode out of line
      x86/mm: Cleanup pgprot_4k_2_large() and pgprot_large_2_4k()
      x86/mm: Unexport __cachemode2pte_tbl
      x86/mm: Use pgprotval_t in protval_4k_2_large() and protval_large_2_4k()

Thomas Gleixner (17):
      x86/tlb: Uninline __get_current_cr3_fast()
      x86/cpu: Uninline CR4 accessors
      x86/cr4: Sanitize CR4.PCE update
      x86/alternatives: Move temporary_mm helpers into C
      x86/tlb: Move __flush_tlb() out of line
      x86/tlb: Move __flush_tlb_global() out of line
      x86/tlb: Move __flush_tlb_one_user() out of line
      x86/tlb: Move __flush_tlb_one_kernel() out of line
      x86/tlb: Move flush_tlb_others() out of line
      x86/tlb: Move __flush_tlb_all() out of line
      x86/tlb: Move paravirt_tlb_remove_table() to the usage site
      x86/tlb: Move cr4_set_bits_and_update_boot() to the usage site
      x86/tlb: Uninline nmi_uaccess_okay()
      x86/tlb: Move PCID helpers where they are used
      xen/privcmd: Remove unneeded asm/tlb.h include
      x86/tlb: Restrict access to tlbstate
      x86/cpu: Export native_write_cr4() only when CONFIG_LKTDM=m


 arch/x86/events/core.c               |  11 +-
 arch/x86/include/asm/memtype.h       |   3 +
 arch/x86/include/asm/mmu_context.h   |  88 +------
 arch/x86/include/asm/paravirt.h      |  12 +-
 arch/x86/include/asm/pgtable_32.h    |   2 +-
 arch/x86/include/asm/pgtable_types.h |  44 +---
 arch/x86/include/asm/tlbflush.h      | 441 ++++-------------------------------
 arch/x86/include/asm/uv/uv.h         |   1 +
 arch/x86/kernel/alternative.c        |  55 +++++
 arch/x86/kernel/cpu/common.c         |  25 +-
 arch/x86/kernel/cpu/mtrr/generic.c   |   4 +-
 arch/x86/kernel/paravirt.c           |  21 +-
 arch/x86/kernel/process.c            |  11 +
 arch/x86/mm/init.c                   |  44 +++-
 arch/x86/mm/init_64.c                |   4 +-
 arch/x86/mm/ioremap.c                |  10 +-
 arch/x86/mm/kmmio.c                  |   2 +-
 arch/x86/mm/mem_encrypt.c            |   2 +-
 arch/x86/mm/pat/set_memory.c         |   7 +-
 arch/x86/mm/pgtable.c                |  16 +-
 arch/x86/mm/pgtable_32.c             |   2 +-
 arch/x86/mm/tlb.c                    | 384 +++++++++++++++++++++++++++++-
 arch/x86/platform/uv/tlb_uv.c        |   4 +-
 drivers/xen/privcmd.c                |   1 -
 24 files changed, 608 insertions(+), 586 deletions(-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [GIT PULL v3] x86/mm changes for v5.8
  2020-06-05  8:11               ` [GIT PULL v3] " Ingo Molnar
@ 2020-06-05 20:40                 ` pr-tracker-bot
  0 siblings, 0 replies; 15+ messages in thread
From: pr-tracker-bot @ 2020-06-05 20:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Thomas Gleixner, Benjamin Herrenschmidt,
	Balbir Singh, Peter Zijlstra, Andrew Morton, Borislav Petkov,
	Linux Kernel Mailing List, Andrew Lutomirski

The pull request you sent on Fri, 5 Jun 2020 10:11:37 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-mm-2020-06-05

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/f4dd60a3d4c7656dcaa0ba2afb503528c86f913f

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, back to index

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-01 17:01 [GIT PULL] x86/mm changes for v5.8 Ingo Molnar
2020-06-01 21:42 ` Linus Torvalds
2020-06-02  2:35   ` Linus Torvalds
2020-06-02 10:25     ` Singh, Balbir
2020-06-02  7:33   ` Ingo Molnar
2020-06-02  9:37     ` Benjamin Herrenschmidt
2020-06-02 18:28       ` Thomas Gleixner
2020-06-02 19:14         ` Linus Torvalds
2020-06-02 23:01           ` Singh, Balbir
2020-06-02 23:28             ` Linus Torvalds
2020-06-03  1:31               ` Singh, Balbir
2020-06-04 17:29           ` [GIT PULL v2] " Ingo Molnar
2020-06-05  2:41             ` Linus Torvalds
2020-06-05  8:11               ` [GIT PULL v3] " Ingo Molnar
2020-06-05 20:40                 ` pr-tracker-bot

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git