All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/6] Delay VERW
@ 2023-10-25 20:52 Pawan Gupta
  2023-10-25 20:52 ` [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW Pawan Gupta
                   ` (5 more replies)
  0 siblings, 6 replies; 32+ messages in thread
From: Pawan Gupta @ 2023-10-25 20:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen
  Cc: linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias, Pawan Gupta, Alyssa Milburn,
	Andrew Cooper, Dave Hansen, Nikolay Borisov

v3:
- Use .entry.text section for VERW memory operand. (Andrew/PeterZ)
- Fix the duplicate header inclusion. (Chao)

v2: https://lore.kernel.org/r/20231024-delay-verw-v2-0-f1881340c807@linux.intel.com
- Removed the extra EXEC_VERW macro layers. (Sean)
- Move NOPL before VERW. (Sean)
- s/USER_CLEAR_CPU_BUFFERS/CLEAR_CPU_BUFFERS/. (Josh/Dave)
- Removed the comments before CLEAR_CPU_BUFFERS. (Josh)
- Remove CLEAR_CPU_BUFFERS from NMI returning to kernel and document the
  reason. (Josh/Dave)
- Reformat comment in md_clear_update_mitigation(). (Josh)
- Squash "x86/bugs: Cleanup mds_user_clear" patch. (Nikolay)
- s/GUEST_CLEAR_CPU_BUFFERS/CLEAR_CPU_BUFFERS/. (Josh)
- Added a patch from Sean to use CFLAGS.CF for VMLAUNCH/VMRESUME
  selection. This facilitates a single CLEAR_CPU_BUFFERS location for both
  VMLAUNCH and VMRESUME. (Sean)

v1: https://lore.kernel.org/r/20231020-delay-verw-v1-0-cff54096326d@linux.intel.com

Hi,

Legacy instruction VERW was overloaded by some processors to clear
micro-architectural CPU buffers as a mitigation of CPU bugs. This series
moves VERW execution to a later point in exit-to-user path. This is
needed because in some cases it may be possible for kernel data to be
accessed after VERW in arch_exit_to_user_mode(). Such accesses may put
data into MDS affected CPU buffers, for example:

  1. Kernel data accessed by an NMI between VERW and return-to-user can
     remain in CPU buffers (since NMI returning to kernel does not
     execute VERW to clear CPU buffers).
  2. Alyssa reported that after VERW is executed,
     CONFIG_GCC_PLUGIN_STACKLEAK=y scrubs the stack used by a system
     call. Memory accesses during stack scrubbing can move kernel stack
     contents into CPU buffers.
  3. When caller saved registers are restored after a return from
     function executing VERW, the kernel stack accesses can remain in
     CPU buffers(since they occur after VERW).

Although these cases are less practical to exploit, moving VERW closer
to ring transition reduces the attack surface.

Overview of the series:

Patch 1: Prepares VERW macros for use in asm.
Patch 2: Adds macros to 64-bit entry/exit points.
Patch 3: Adds macros to 32-bit entry/exit points.
Patch 4: Enables the new macros.
Patch 5: Uses CFLAGS.CF for VMLAUNCH/VMRESUME selection.
Patch 6: Adds macro to VMenter.

Below is some performance data collected with v1 on a Skylake client
compared with previous implementation:

Baseline: v6.6-rc5

| Test               | Configuration          | Relative |
| ------------------ | ---------------------- | -------- |
| build-linux-kernel | defconfig              | 1.00     |
| hackbench          | 32 - Process           | 1.02     |
| nginx              | Short Connection - 500 | 1.01     |

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
Pawan Gupta (5):
      x86/bugs: Add asm helpers for executing VERW
      x86/entry_64: Add VERW just before userspace transition
      x86/entry_32: Add VERW just before userspace transition
      x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key
      KVM: VMX: Move VERW closer to VMentry for MDS mitigation

Sean Christopherson (1):
      KVM: VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH

 Documentation/arch/x86/mds.rst       | 39 ++++++++++++++++++++++++++----------
 arch/x86/entry/entry.S               | 16 +++++++++++++++
 arch/x86/entry/entry_32.S            |  3 +++
 arch/x86/entry/entry_64.S            | 11 ++++++++++
 arch/x86/entry/entry_64_compat.S     |  1 +
 arch/x86/include/asm/cpufeatures.h   |  2 +-
 arch/x86/include/asm/entry-common.h  |  1 -
 arch/x86/include/asm/nospec-branch.h | 27 ++++++++++++++-----------
 arch/x86/kernel/cpu/bugs.c           | 15 ++++++--------
 arch/x86/kernel/nmi.c                |  2 --
 arch/x86/kvm/vmx/run_flags.h         |  7 +++++--
 arch/x86/kvm/vmx/vmenter.S           |  9 ++++++---
 arch/x86/kvm/vmx/vmx.c               | 10 ++++++---
 13 files changed, 99 insertions(+), 44 deletions(-)
---
base-commit: 05d3ef8bba77c1b5f98d941d8b2d4aeab8118ef1
change-id: 20231011-delay-verw-d0474986b2c3

Best regards,
-- 
Thanks,
Pawan



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH  v3 1/6] x86/bugs: Add asm helpers for executing VERW
  2023-10-25 20:52 [PATCH v3 0/6] Delay VERW Pawan Gupta
@ 2023-10-25 20:52 ` Pawan Gupta
  2023-10-25 21:10   ` Andrew Cooper
  2023-10-26 13:44   ` Nikolay Borisov
  2023-10-25 20:52 ` [PATCH v3 2/6] x86/entry_64: Add VERW just before userspace transition Pawan Gupta
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 32+ messages in thread
From: Pawan Gupta @ 2023-10-25 20:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen
  Cc: linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias, Pawan Gupta, Alyssa Milburn,
	Andrew Cooper

MDS mitigation requires clearing the CPU buffers before returning to
user. This needs to be done late in the exit-to-user path. Current
location of VERW leaves a possibility of kernel data ending up in CPU
buffers for memory accesses done after VERW such as:

  1. Kernel data accessed by an NMI between VERW and return-to-user can
     remain in CPU buffers ( since NMI returning to kernel does not
     execute VERW to clear CPU buffers.
  2. Alyssa reported that after VERW is executed,
     CONFIG_GCC_PLUGIN_STACKLEAK=y scrubs the stack used by a system
     call. Memory accesses during stack scrubbing can move kernel stack
     contents into CPU buffers.
  3. When caller saved registers are restored after a return from
     function executing VERW, the kernel stack accesses can remain in
     CPU buffers(since they occur after VERW).

To fix this VERW needs to be moved very late in exit-to-user path.

In preparation for moving VERW to entry/exit asm code, create macros
that can be used in asm. Also make them depend on a new feature flag
X86_FEATURE_CLEAR_CPU_BUF.

Reported-by: Alyssa Milburn <alyssa.milburn@intel.com>
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/entry/entry.S               | 16 ++++++++++++++++
 arch/x86/include/asm/cpufeatures.h   |  2 +-
 arch/x86/include/asm/nospec-branch.h | 15 +++++++++++++++
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry.S b/arch/x86/entry/entry.S
index bfb7bcb362bc..f8ba0c0b6e60 100644
--- a/arch/x86/entry/entry.S
+++ b/arch/x86/entry/entry.S
@@ -6,6 +6,9 @@
 #include <linux/linkage.h>
 #include <asm/export.h>
 #include <asm/msr-index.h>
+#include <asm/unwind_hints.h>
+#include <asm/segment.h>
+#include <asm/cache.h>
 
 .pushsection .noinstr.text, "ax"
 
@@ -20,3 +23,16 @@ SYM_FUNC_END(entry_ibpb)
 EXPORT_SYMBOL_GPL(entry_ibpb);
 
 .popsection
+
+.pushsection .entry.text, "ax"
+
+.align L1_CACHE_BYTES, 0xcc
+SYM_CODE_START_NOALIGN(mds_verw_sel)
+	UNWIND_HINT_UNDEFINED
+	ANNOTATE_NOENDBR
+	.word __KERNEL_DS
+SYM_CODE_END(mds_verw_sel);
+/* For KVM */
+EXPORT_SYMBOL_GPL(mds_verw_sel);
+
+.popsection
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 58cb9495e40f..f21fc0f12737 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -308,10 +308,10 @@
 #define X86_FEATURE_SMBA		(11*32+21) /* "" Slow Memory Bandwidth Allocation */
 #define X86_FEATURE_BMEC		(11*32+22) /* "" Bandwidth Monitoring Event Configuration */
 #define X86_FEATURE_USER_SHSTK		(11*32+23) /* Shadow stack support for user mode applications */
-
 #define X86_FEATURE_SRSO		(11*32+24) /* "" AMD BTB untrain RETs */
 #define X86_FEATURE_SRSO_ALIAS		(11*32+25) /* "" AMD BTB untrain RETs through aliasing */
 #define X86_FEATURE_IBPB_ON_VMEXIT	(11*32+26) /* "" Issue an IBPB only on VMEXIT */
+#define X86_FEATURE_CLEAR_CPU_BUF	(11*32+27) /* "" Clear CPU buffers */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX_VNNI		(12*32+ 4) /* AVX VNNI instructions */
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index c55cc243592e..005e69f93115 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -329,6 +329,21 @@
 #endif
 .endm
 
+/*
+ * Macros to execute VERW instruction that mitigate transient data sampling
+ * attacks such as MDS. On affected systems a microcode update overloaded VERW
+ * instruction to also clear the CPU buffers. VERW clobbers CFLAGS.ZF.
+ *
+ * Note: Only the memory operand variant of VERW clears the CPU buffers.
+ */
+.macro EXEC_VERW
+	verw _ASM_RIP(mds_verw_sel)
+.endm
+
+.macro CLEAR_CPU_BUFFERS
+	ALTERNATIVE "", __stringify(EXEC_VERW), X86_FEATURE_CLEAR_CPU_BUF
+.endm
+
 #else /* __ASSEMBLY__ */
 
 #define ANNOTATE_RETPOLINE_SAFE					\

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH  v3 2/6] x86/entry_64: Add VERW just before userspace transition
  2023-10-25 20:52 [PATCH v3 0/6] Delay VERW Pawan Gupta
  2023-10-25 20:52 ` [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW Pawan Gupta
@ 2023-10-25 20:52 ` Pawan Gupta
  2023-10-26 16:25   ` Nikolay Borisov
  2023-10-25 20:53 ` [PATCH v3 3/6] x86/entry_32: " Pawan Gupta
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Pawan Gupta @ 2023-10-25 20:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen
  Cc: linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias, Pawan Gupta, Dave Hansen

Mitigation for MDS is to use VERW instruction to clear any secrets in
CPU Buffers. Any memory accesses after VERW execution can still remain
in CPU buffers. It is safer to execute VERW late in return to user path
to minimize the window in which kernel data can end up in CPU buffers.
There are not many kernel secrets to be had after SWITCH_TO_USER_CR3.

Add support for deploying VERW mitigation after user register state is
restored. This helps minimize the chances of kernel data ending up into
CPU buffers after executing VERW.

Note that the mitigation at the new location is not yet enabled.

  Corner case not handled
  =======================
  Interrupts returning to kernel don't clear CPUs buffers since the
  exit-to-user path is expected to do that anyways. But, there could be
  a case when an NMI is generated in kernel after the exit-to-user path
  has cleared the buffers. This case is not handled and NMI returning to
  kernel don't clear CPU buffers because:

  1. It is rare to get an NMI after VERW, but before returning to userspace.
  2. For an unprivileged user, there is no known way to make that NMI
     less rare or target it.
  3. It would take a large number of these precisely-timed NMIs to mount
     an actual attack.  There's presumably not enough bandwidth.
  4. The NMI in question occurs after a VERW, i.e. when user state is
     restored and most interesting data is already scrubbed. Whats left
     is only the data that NMI touches, and that may or may not be of
     any interest.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/entry/entry_64.S        | 11 +++++++++++
 arch/x86/entry/entry_64_compat.S |  1 +
 2 files changed, 12 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 43606de22511..9f97a8bd11e8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -223,6 +223,7 @@ syscall_return_via_sysret:
 SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack, SYM_L_GLOBAL)
 	ANNOTATE_NOENDBR
 	swapgs
+	CLEAR_CPU_BUFFERS
 	sysretq
 SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL)
 	ANNOTATE_NOENDBR
@@ -663,6 +664,7 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 	/* Restore RDI. */
 	popq	%rdi
 	swapgs
+	CLEAR_CPU_BUFFERS
 	jmp	.Lnative_iret
 
 
@@ -774,6 +776,8 @@ native_irq_return_ldt:
 	 */
 	popq	%rax				/* Restore user RAX */
 
+	CLEAR_CPU_BUFFERS
+
 	/*
 	 * RSP now points to an ordinary IRET frame, except that the page
 	 * is read-only and RSP[31:16] are preloaded with the userspace
@@ -1502,6 +1506,12 @@ nmi_restore:
 	std
 	movq	$0, 5*8(%rsp)		/* clear "NMI executing" */
 
+	/*
+	 * Skip CLEAR_CPU_BUFFERS here, since it only helps in rare cases like
+	 * NMI in kernel after user state is restored. For an unprivileged user
+	 * these conditions are hard to meet.
+	 */
+
 	/*
 	 * iretq reads the "iret" frame and exits the NMI stack in a
 	 * single instruction.  We are returning to kernel mode, so this
@@ -1520,6 +1530,7 @@ SYM_CODE_START(ignore_sysret)
 	UNWIND_HINT_END_OF_STACK
 	ENDBR
 	mov	$-ENOSYS, %eax
+	CLEAR_CPU_BUFFERS
 	sysretl
 SYM_CODE_END(ignore_sysret)
 #endif
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 70150298f8bd..245697eb8485 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -271,6 +271,7 @@ SYM_INNER_LABEL(entry_SYSRETL_compat_unsafe_stack, SYM_L_GLOBAL)
 	xorl	%r9d, %r9d
 	xorl	%r10d, %r10d
 	swapgs
+	CLEAR_CPU_BUFFERS
 	sysretl
 SYM_INNER_LABEL(entry_SYSRETL_compat_end, SYM_L_GLOBAL)
 	ANNOTATE_NOENDBR

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH  v3 3/6] x86/entry_32: Add VERW just before userspace transition
  2023-10-25 20:52 [PATCH v3 0/6] Delay VERW Pawan Gupta
  2023-10-25 20:52 ` [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW Pawan Gupta
  2023-10-25 20:52 ` [PATCH v3 2/6] x86/entry_64: Add VERW just before userspace transition Pawan Gupta
@ 2023-10-25 20:53 ` Pawan Gupta
  2023-10-25 20:53 ` [PATCH v3 4/6] x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key Pawan Gupta
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Pawan Gupta @ 2023-10-25 20:53 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen
  Cc: linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias, Pawan Gupta

As done for entry_64, add support for executing VERW late in exit to
user path for 32-bit mode.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/entry/entry_32.S | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 6e6af42e044a..74a4358c7f45 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -885,6 +885,7 @@ SYM_FUNC_START(entry_SYSENTER_32)
 	BUG_IF_WRONG_CR3 no_user_check=1
 	popfl
 	popl	%eax
+	CLEAR_CPU_BUFFERS
 
 	/*
 	 * Return back to the vDSO, which will pop ecx and edx.
@@ -954,6 +955,7 @@ restore_all_switch_stack:
 
 	/* Restore user state */
 	RESTORE_REGS pop=4			# skip orig_eax/error_code
+	CLEAR_CPU_BUFFERS
 .Lirq_return:
 	/*
 	 * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
@@ -1146,6 +1148,7 @@ SYM_CODE_START(asm_exc_nmi)
 
 	/* Not on SYSENTER stack. */
 	call	exc_nmi
+	CLEAR_CPU_BUFFERS
 	jmp	.Lnmi_return
 
 .Lnmi_from_sysenter_stack:

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH  v3 4/6] x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key
  2023-10-25 20:52 [PATCH v3 0/6] Delay VERW Pawan Gupta
                   ` (2 preceding siblings ...)
  2023-10-25 20:53 ` [PATCH v3 3/6] x86/entry_32: " Pawan Gupta
@ 2023-10-25 20:53 ` Pawan Gupta
  2023-10-25 20:53 ` [PATCH v3 5/6] KVM: VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH Pawan Gupta
  2023-10-25 20:53 ` [PATCH v3 6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation Pawan Gupta
  5 siblings, 0 replies; 32+ messages in thread
From: Pawan Gupta @ 2023-10-25 20:53 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen
  Cc: linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias, Pawan Gupta

The VERW mitigation at exit-to-user is enabled via a static branch
mds_user_clear. This static branch is never toggled after boot, and can
be safely replaced with an ALTERNATIVE() which is convenient to use in
asm.

Switch to ALTERNATIVE() to use the VERW mitigation late in exit-to-user
path. Also remove the now redundant VERW in exc_nmi() and
arch_exit_to_user_mode().

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 Documentation/arch/x86/mds.rst       | 39 ++++++++++++++++++++++++++----------
 arch/x86/include/asm/entry-common.h  |  1 -
 arch/x86/include/asm/nospec-branch.h | 12 -----------
 arch/x86/kernel/cpu/bugs.c           | 15 ++++++--------
 arch/x86/kernel/nmi.c                |  2 --
 arch/x86/kvm/vmx/vmx.c               |  2 +-
 6 files changed, 35 insertions(+), 36 deletions(-)

diff --git a/Documentation/arch/x86/mds.rst b/Documentation/arch/x86/mds.rst
index e73fdff62c0a..34b9e476078c 100644
--- a/Documentation/arch/x86/mds.rst
+++ b/Documentation/arch/x86/mds.rst
@@ -95,6 +95,9 @@ The kernel provides a function to invoke the buffer clearing:
 
     mds_clear_cpu_buffers()
 
+Also macro CLEAR_CPU_BUFFERS is meant to be used in ASM late in exit-to-user
+path. This macro works for cases where GPRs can't be clobbered.
+
 The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
 (idle) transitions.
 
@@ -138,17 +141,31 @@ Mitigation points
 
    When transitioning from kernel to user space the CPU buffers are flushed
    on affected CPUs when the mitigation is not disabled on the kernel
-   command line. The migitation is enabled through the static key
-   mds_user_clear.
-
-   The mitigation is invoked in prepare_exit_to_usermode() which covers
-   all but one of the kernel to user space transitions.  The exception
-   is when we return from a Non Maskable Interrupt (NMI), which is
-   handled directly in do_nmi().
-
-   (The reason that NMI is special is that prepare_exit_to_usermode() can
-    enable IRQs.  In NMI context, NMIs are blocked, and we don't want to
-    enable IRQs with NMIs blocked.)
+   command line. The mitigation is enabled through the feature flag
+   X86_FEATURE_CLEAR_CPU_BUF.
+
+   The mitigation is invoked just before transitioning to userspace after
+   user registers are restored. This is done to minimize the window in
+   which kernel data could be accessed after VERW e.g. via an NMI after
+   VERW.
+
+   Corner case not handled
+   ^^^^^^^^^^^^^^^^^^^^^^^
+   Interrupts returning to kernel don't clear CPUs buffers since the
+   exit-to-user path is expected to do that anyways. But, there could be
+   a case when an NMI is generated in kernel after the exit-to-user path
+   has cleared the buffers. This case is not handled and NMI returning to
+   kernel don't clear CPU buffers because:
+
+   1. It is rare to get an NMI after VERW, but before returning to userspace.
+   2. For an unprivileged user, there is no known way to make that NMI
+      less rare or target it.
+   3. It would take a large number of these precisely-timed NMIs to mount
+      an actual attack.  There's presumably not enough bandwidth.
+   4. The NMI in question occurs after a VERW, i.e. when user state is
+      restored and most interesting data is already scrubbed. Whats left
+      is only the data that NMI touches, and that may or may not be of
+      any interest.
 
 
 2. C-State transition
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index ce8f50192ae3..7e523bb3d2d3 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -91,7 +91,6 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
 
 static __always_inline void arch_exit_to_user_mode(void)
 {
-	mds_user_clear_cpu_buffers();
 	amd_clear_divider();
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 005e69f93115..12b8e86678bf 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -553,7 +553,6 @@ DECLARE_STATIC_KEY_FALSE(switch_to_cond_stibp);
 DECLARE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
 DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
 
-DECLARE_STATIC_KEY_FALSE(mds_user_clear);
 DECLARE_STATIC_KEY_FALSE(mds_idle_clear);
 
 DECLARE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
@@ -585,17 +584,6 @@ static __always_inline void mds_clear_cpu_buffers(void)
 	asm volatile("verw %[ds]" : : [ds] "m" (ds) : "cc");
 }
 
-/**
- * mds_user_clear_cpu_buffers - Mitigation for MDS and TAA vulnerability
- *
- * Clear CPU buffers if the corresponding static key is enabled
- */
-static __always_inline void mds_user_clear_cpu_buffers(void)
-{
-	if (static_branch_likely(&mds_user_clear))
-		mds_clear_cpu_buffers();
-}
-
 /**
  * mds_idle_clear_cpu_buffers - Mitigation for MDS vulnerability
  *
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 10499bcd4e39..00aab0c0937f 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -111,9 +111,6 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
 /* Control unconditional IBPB in switch_mm() */
 DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
 
-/* Control MDS CPU buffer clear before returning to user space */
-DEFINE_STATIC_KEY_FALSE(mds_user_clear);
-EXPORT_SYMBOL_GPL(mds_user_clear);
 /* Control MDS CPU buffer clear before idling (halt, mwait) */
 DEFINE_STATIC_KEY_FALSE(mds_idle_clear);
 EXPORT_SYMBOL_GPL(mds_idle_clear);
@@ -252,7 +249,7 @@ static void __init mds_select_mitigation(void)
 		if (!boot_cpu_has(X86_FEATURE_MD_CLEAR))
 			mds_mitigation = MDS_MITIGATION_VMWERV;
 
-		static_branch_enable(&mds_user_clear);
+		setup_force_cpu_cap(X86_FEATURE_CLEAR_CPU_BUF);
 
 		if (!boot_cpu_has(X86_BUG_MSBDS_ONLY) &&
 		    (mds_nosmt || cpu_mitigations_auto_nosmt()))
@@ -356,7 +353,7 @@ static void __init taa_select_mitigation(void)
 	 * For guests that can't determine whether the correct microcode is
 	 * present on host, enable the mitigation for UCODE_NEEDED as well.
 	 */
-	static_branch_enable(&mds_user_clear);
+	setup_force_cpu_cap(X86_FEATURE_CLEAR_CPU_BUF);
 
 	if (taa_nosmt || cpu_mitigations_auto_nosmt())
 		cpu_smt_disable(false);
@@ -424,7 +421,7 @@ static void __init mmio_select_mitigation(void)
 	 */
 	if (boot_cpu_has_bug(X86_BUG_MDS) || (boot_cpu_has_bug(X86_BUG_TAA) &&
 					      boot_cpu_has(X86_FEATURE_RTM)))
-		static_branch_enable(&mds_user_clear);
+		setup_force_cpu_cap(X86_FEATURE_CLEAR_CPU_BUF);
 	else
 		static_branch_enable(&mmio_stale_data_clear);
 
@@ -484,12 +481,12 @@ static void __init md_clear_update_mitigation(void)
 	if (cpu_mitigations_off())
 		return;
 
-	if (!static_key_enabled(&mds_user_clear))
+	if (!boot_cpu_has(X86_FEATURE_CLEAR_CPU_BUF))
 		goto out;
 
 	/*
-	 * mds_user_clear is now enabled. Update MDS, TAA and MMIO Stale Data
-	 * mitigation, if necessary.
+	 * X86_FEATURE_CLEAR_CPU_BUF is now enabled. Update MDS, TAA and MMIO
+	 * Stale Data mitigation, if necessary.
 	 */
 	if (mds_mitigation == MDS_MITIGATION_OFF &&
 	    boot_cpu_has_bug(X86_BUG_MDS)) {
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index a0c551846b35..ebfff8dca661 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -551,8 +551,6 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 	if (this_cpu_dec_return(nmi_state))
 		goto nmi_restart;
 
-	if (user_mode(regs))
-		mds_user_clear_cpu_buffers();
 	if (IS_ENABLED(CONFIG_NMI_CHECK_CPU)) {
 		WRITE_ONCE(nsp->idt_seq, nsp->idt_seq + 1);
 		WARN_ON_ONCE(nsp->idt_seq & 0x1);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 72e3943f3693..24e8694b83fc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7229,7 +7229,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 	/* L1D Flush includes CPU buffer clear to mitigate MDS */
 	if (static_branch_unlikely(&vmx_l1d_should_flush))
 		vmx_l1d_flush(vcpu);
-	else if (static_branch_unlikely(&mds_user_clear))
+	else if (cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
 		mds_clear_cpu_buffers();
 	else if (static_branch_unlikely(&mmio_stale_data_clear) &&
 		 kvm_arch_has_assigned_device(vcpu->kvm))

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH  v3 5/6] KVM: VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH
  2023-10-25 20:52 [PATCH v3 0/6] Delay VERW Pawan Gupta
                   ` (3 preceding siblings ...)
  2023-10-25 20:53 ` [PATCH v3 4/6] x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key Pawan Gupta
@ 2023-10-25 20:53 ` Pawan Gupta
  2023-10-25 20:53 ` [PATCH v3 6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation Pawan Gupta
  5 siblings, 0 replies; 32+ messages in thread
From: Pawan Gupta @ 2023-10-25 20:53 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen
  Cc: linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias, Pawan Gupta, Nikolay Borisov

From: Sean Christopherson <seanjc@google.com>

Use EFLAGS.CF instead of EFLAGS.ZF to track whether to use VMRESUME versus
VMLAUNCH.  Freeing up EFLAGS.ZF will allow doing VERW, which clobbers ZF,
for MDS mitigations as late as possible without needing to duplicate VERW
for both paths.

Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/kvm/vmx/run_flags.h | 7 +++++--
 arch/x86/kvm/vmx/vmenter.S   | 6 +++---
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/run_flags.h b/arch/x86/kvm/vmx/run_flags.h
index edc3f16cc189..6a9bfdfbb6e5 100644
--- a/arch/x86/kvm/vmx/run_flags.h
+++ b/arch/x86/kvm/vmx/run_flags.h
@@ -2,7 +2,10 @@
 #ifndef __KVM_X86_VMX_RUN_FLAGS_H
 #define __KVM_X86_VMX_RUN_FLAGS_H
 
-#define VMX_RUN_VMRESUME	(1 << 0)
-#define VMX_RUN_SAVE_SPEC_CTRL	(1 << 1)
+#define VMX_RUN_VMRESUME_SHIFT		0
+#define VMX_RUN_SAVE_SPEC_CTRL_SHIFT	1
+
+#define VMX_RUN_VMRESUME		BIT(VMX_RUN_VMRESUME_SHIFT)
+#define VMX_RUN_SAVE_SPEC_CTRL		BIT(VMX_RUN_SAVE_SPEC_CTRL_SHIFT)
 
 #endif /* __KVM_X86_VMX_RUN_FLAGS_H */
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index be275a0410a8..b3b13ec04bac 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -139,7 +139,7 @@ SYM_FUNC_START(__vmx_vcpu_run)
 	mov (%_ASM_SP), %_ASM_AX
 
 	/* Check if vmlaunch or vmresume is needed */
-	test $VMX_RUN_VMRESUME, %ebx
+	bt   $VMX_RUN_VMRESUME_SHIFT, %ebx
 
 	/* Load guest registers.  Don't clobber flags. */
 	mov VCPU_RCX(%_ASM_AX), %_ASM_CX
@@ -161,8 +161,8 @@ SYM_FUNC_START(__vmx_vcpu_run)
 	/* Load guest RAX.  This kills the @regs pointer! */
 	mov VCPU_RAX(%_ASM_AX), %_ASM_AX
 
-	/* Check EFLAGS.ZF from 'test VMX_RUN_VMRESUME' above */
-	jz .Lvmlaunch
+	/* Check EFLAGS.CF from the VMX_RUN_VMRESUME bit test above. */
+	jnc .Lvmlaunch
 
 	/*
 	 * After a successful VMRESUME/VMLAUNCH, control flow "magically"

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH  v3 6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation
  2023-10-25 20:52 [PATCH v3 0/6] Delay VERW Pawan Gupta
                   ` (4 preceding siblings ...)
  2023-10-25 20:53 ` [PATCH v3 5/6] KVM: VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH Pawan Gupta
@ 2023-10-25 20:53 ` Pawan Gupta
  2023-10-26 16:14   ` Nikolay Borisov
  2023-10-26 19:30   ` Sean Christopherson
  5 siblings, 2 replies; 32+ messages in thread
From: Pawan Gupta @ 2023-10-25 20:53 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen
  Cc: linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias, Pawan Gupta

During VMentry VERW is executed to mitigate MDS. After VERW, any memory
access like register push onto stack may put host data in MDS affected
CPU buffers. A guest can then use MDS to sample host data.

Although likelihood of secrets surviving in registers at current VERW
callsite is less, but it can't be ruled out. Harden the MDS mitigation
by moving the VERW mitigation late in VMentry path.

Note that VERW for MMIO Stale Data mitigation is unchanged because of
the complexity of per-guest conditional VERW which is not easy to handle
that late in asm with no GPRs available. If the CPU is also affected by
MDS, VERW is unconditionally executed late in asm regardless of guest
having MMIO access.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/kvm/vmx/vmenter.S |  3 +++
 arch/x86/kvm/vmx/vmx.c     | 10 +++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index b3b13ec04bac..139960deb736 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -161,6 +161,9 @@ SYM_FUNC_START(__vmx_vcpu_run)
 	/* Load guest RAX.  This kills the @regs pointer! */
 	mov VCPU_RAX(%_ASM_AX), %_ASM_AX
 
+	/* Clobbers EFLAGS.ZF */
+	CLEAR_CPU_BUFFERS
+
 	/* Check EFLAGS.CF from the VMX_RUN_VMRESUME bit test above. */
 	jnc .Lvmlaunch
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 24e8694b83fc..2d149589cf5b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7226,13 +7226,17 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 
 	guest_state_enter_irqoff();
 
-	/* L1D Flush includes CPU buffer clear to mitigate MDS */
+	/*
+	 * L1D Flush includes CPU buffer clear to mitigate MDS, but VERW
+	 * mitigation for MDS is done late in VMentry and is still
+	 * executed inspite of L1D Flush. This is because an extra VERW
+	 * should not matter much after the big hammer L1D Flush.
+	 */
 	if (static_branch_unlikely(&vmx_l1d_should_flush))
 		vmx_l1d_flush(vcpu);
-	else if (cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
-		mds_clear_cpu_buffers();
 	else if (static_branch_unlikely(&mmio_stale_data_clear) &&
 		 kvm_arch_has_assigned_device(vcpu->kvm))
+		/* MMIO mitigation is mutually exclusive with MDS mitigation later in asm */
 		mds_clear_cpu_buffers();
 
 	vmx_disable_fb_clear(vmx);

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW
  2023-10-25 20:52 ` [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW Pawan Gupta
@ 2023-10-25 21:10   ` Andrew Cooper
  2023-10-25 21:28     ` Josh Poimboeuf
  2023-10-25 22:07     ` Pawan Gupta
  2023-10-26 13:44   ` Nikolay Borisov
  1 sibling, 2 replies; 32+ messages in thread
From: Andrew Cooper @ 2023-10-25 21:10 UTC (permalink / raw)
  To: Pawan Gupta, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf,
	Andy Lutomirski, Jonathan Corbet, Sean Christopherson,
	Paolo Bonzini, tony.luck, ak, tim.c.chen
  Cc: linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias, Alyssa Milburn

On 25/10/2023 9:52 pm, Pawan Gupta wrote:
> diff --git a/arch/x86/entry/entry.S b/arch/x86/entry/entry.S
> index bfb7bcb362bc..f8ba0c0b6e60 100644
> --- a/arch/x86/entry/entry.S
> +++ b/arch/x86/entry/entry.S
> @@ -20,3 +23,16 @@ SYM_FUNC_END(entry_ibpb)
>  EXPORT_SYMBOL_GPL(entry_ibpb);
>  
>  .popsection
> +
> +.pushsection .entry.text, "ax"
> +
> +.align L1_CACHE_BYTES, 0xcc
> +SYM_CODE_START_NOALIGN(mds_verw_sel)
> +	UNWIND_HINT_UNDEFINED
> +	ANNOTATE_NOENDBR
> +	.word __KERNEL_DS

You need another .align here.  Otherwise subsequent code will still
start in this cacheline and defeat the purpose of trying to keep it
separate.

> +SYM_CODE_END(mds_verw_sel);

Thinking about it, should this really be CODE and not a data entry?

It lives in .entry.text but it really is data and objtool shouldn't be
writing ORC data for it at all.

(Not to mention that if it's marked as STT_OBJECT, objdump -d will do
the sensible thing and not even try to disassemble it).

~Andrew

P.S. Please CC on the full series.  Far less effort than fishing the
rest off lore.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW
  2023-10-25 21:10   ` Andrew Cooper
@ 2023-10-25 21:28     ` Josh Poimboeuf
  2023-10-25 21:30       ` Andrew Cooper
  2023-10-25 22:07     ` Pawan Gupta
  1 sibling, 1 reply; 32+ messages in thread
From: Josh Poimboeuf @ 2023-10-25 21:28 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Pawan Gupta, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra,
	Andy Lutomirski, Jonathan Corbet, Sean Christopherson,
	Paolo Bonzini, tony.luck, ak, tim.c.chen, linux-kernel,
	linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias, Alyssa Milburn

On Wed, Oct 25, 2023 at 10:10:41PM +0100, Andrew Cooper wrote:
> On 25/10/2023 9:52 pm, Pawan Gupta wrote:
> > diff --git a/arch/x86/entry/entry.S b/arch/x86/entry/entry.S
> > index bfb7bcb362bc..f8ba0c0b6e60 100644
> > --- a/arch/x86/entry/entry.S
> > +++ b/arch/x86/entry/entry.S
> > @@ -20,3 +23,16 @@ SYM_FUNC_END(entry_ibpb)
> >  EXPORT_SYMBOL_GPL(entry_ibpb);
> >  
> >  .popsection
> > +
> > +.pushsection .entry.text, "ax"
> > +
> > +.align L1_CACHE_BYTES, 0xcc
> > +SYM_CODE_START_NOALIGN(mds_verw_sel)
> > +	UNWIND_HINT_UNDEFINED
> > +	ANNOTATE_NOENDBR
> > +	.word __KERNEL_DS
> 
> You need another .align here.  Otherwise subsequent code will still
> start in this cacheline and defeat the purpose of trying to keep it
> separate.
> 
> > +SYM_CODE_END(mds_verw_sel);
> 
> Thinking about it, should this really be CODE and not a data entry?
> 
> It lives in .entry.text but it really is data and objtool shouldn't be
> writing ORC data for it at all.
> 
> (Not to mention that if it's marked as STT_OBJECT, objdump -d will do
> the sensible thing and not even try to disassemble it).
> 
> ~Andrew
> 
> P.S. Please CC on the full series.  Far less effort than fishing the
> rest off lore.

+1 to putting it in .rodata or so.

-- 
Josh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW
  2023-10-25 21:28     ` Josh Poimboeuf
@ 2023-10-25 21:30       ` Andrew Cooper
  2023-10-25 21:49         ` Josh Poimboeuf
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Cooper @ 2023-10-25 21:30 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Pawan Gupta, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra,
	Andy Lutomirski, Jonathan Corbet, Sean Christopherson,
	Paolo Bonzini, tony.luck, ak, tim.c.chen, linux-kernel,
	linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias, Alyssa Milburn

On 25/10/2023 10:28 pm, Josh Poimboeuf wrote:
> On Wed, Oct 25, 2023 at 10:10:41PM +0100, Andrew Cooper wrote:
>> On 25/10/2023 9:52 pm, Pawan Gupta wrote:
>>> diff --git a/arch/x86/entry/entry.S b/arch/x86/entry/entry.S
>>> index bfb7bcb362bc..f8ba0c0b6e60 100644
>>> --- a/arch/x86/entry/entry.S
>>> +++ b/arch/x86/entry/entry.S
>>> @@ -20,3 +23,16 @@ SYM_FUNC_END(entry_ibpb)
>>>  EXPORT_SYMBOL_GPL(entry_ibpb);
>>>  
>>>  .popsection
>>> +
>>> +.pushsection .entry.text, "ax"
>>> +
>>> +.align L1_CACHE_BYTES, 0xcc
>>> +SYM_CODE_START_NOALIGN(mds_verw_sel)
>>> +	UNWIND_HINT_UNDEFINED
>>> +	ANNOTATE_NOENDBR
>>> +	.word __KERNEL_DS
>> You need another .align here.  Otherwise subsequent code will still
>> start in this cacheline and defeat the purpose of trying to keep it
>> separate.
>>
>>> +SYM_CODE_END(mds_verw_sel);
>> Thinking about it, should this really be CODE and not a data entry?
>>
>> It lives in .entry.text but it really is data and objtool shouldn't be
>> writing ORC data for it at all.
>>
>> (Not to mention that if it's marked as STT_OBJECT, objdump -d will do
>> the sensible thing and not even try to disassemble it).
>>
>> ~Andrew
>>
>> P.S. Please CC on the full series.  Far less effort than fishing the
>> rest off lore.
> +1 to putting it in .rodata or so.

It's necessarily in .entry.text so it doesn't explode with KPTI active.

~Andrew

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW
  2023-10-25 21:30       ` Andrew Cooper
@ 2023-10-25 21:49         ` Josh Poimboeuf
  0 siblings, 0 replies; 32+ messages in thread
From: Josh Poimboeuf @ 2023-10-25 21:49 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Pawan Gupta, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra,
	Andy Lutomirski, Jonathan Corbet, Sean Christopherson,
	Paolo Bonzini, tony.luck, ak, tim.c.chen, linux-kernel,
	linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias, Alyssa Milburn

On Wed, Oct 25, 2023 at 10:30:52PM +0100, Andrew Cooper wrote:
> On 25/10/2023 10:28 pm, Josh Poimboeuf wrote:
> > On Wed, Oct 25, 2023 at 10:10:41PM +0100, Andrew Cooper wrote:
> >> On 25/10/2023 9:52 pm, Pawan Gupta wrote:
> >>> diff --git a/arch/x86/entry/entry.S b/arch/x86/entry/entry.S
> >>> index bfb7bcb362bc..f8ba0c0b6e60 100644
> >>> --- a/arch/x86/entry/entry.S
> >>> +++ b/arch/x86/entry/entry.S
> >>> @@ -20,3 +23,16 @@ SYM_FUNC_END(entry_ibpb)
> >>>  EXPORT_SYMBOL_GPL(entry_ibpb);
> >>>  
> >>>  .popsection
> >>> +
> >>> +.pushsection .entry.text, "ax"
> >>> +
> >>> +.align L1_CACHE_BYTES, 0xcc
> >>> +SYM_CODE_START_NOALIGN(mds_verw_sel)
> >>> +	UNWIND_HINT_UNDEFINED
> >>> +	ANNOTATE_NOENDBR
> >>> +	.word __KERNEL_DS
> >> You need another .align here.  Otherwise subsequent code will still
> >> start in this cacheline and defeat the purpose of trying to keep it
> >> separate.
> >>
> >>> +SYM_CODE_END(mds_verw_sel);
> >> Thinking about it, should this really be CODE and not a data entry?
> >>
> >> It lives in .entry.text but it really is data and objtool shouldn't be
> >> writing ORC data for it at all.
> >>
> >> (Not to mention that if it's marked as STT_OBJECT, objdump -d will do
> >> the sensible thing and not even try to disassemble it).
> >>
> >> ~Andrew
> >>
> >> P.S. Please CC on the full series.  Far less effort than fishing the
> >> rest off lore.
> > +1 to putting it in .rodata or so.
> 
> It's necessarily in .entry.text so it doesn't explode with KPTI active.

Ah, right.  In general tooling doesn't take too kindly to putting data
in a text section.  But it might be ok.

-- 
Josh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW
  2023-10-25 21:10   ` Andrew Cooper
  2023-10-25 21:28     ` Josh Poimboeuf
@ 2023-10-25 22:07     ` Pawan Gupta
  2023-10-25 22:13       ` Andrew Cooper
  1 sibling, 1 reply; 32+ messages in thread
From: Pawan Gupta @ 2023-10-25 22:07 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen, linux-kernel, linux-doc, kvm, Alyssa Milburn,
	Daniel Sneddon, antonio.gomez.iglesias, Alyssa Milburn

On Wed, Oct 25, 2023 at 10:10:41PM +0100, Andrew Cooper wrote:
> > +.align L1_CACHE_BYTES, 0xcc
> > +SYM_CODE_START_NOALIGN(mds_verw_sel)
> > +	UNWIND_HINT_UNDEFINED
> > +	ANNOTATE_NOENDBR
> > +	.word __KERNEL_DS
> 
> You need another .align here.  Otherwise subsequent code will still
> start in this cacheline and defeat the purpose of trying to keep it
> separate.

Right.

> > +SYM_CODE_END(mds_verw_sel);
> 
> Thinking about it, should this really be CODE and not a data entry?

Would that require adding a data equivalent of .entry.text and update
KPTI to keep it mapped? Or is there an easier option?

> P.S. Please CC on the full series.  Far less effort than fishing the
> rest off lore.

I didn't realize get_maintainer.pl isn't doing that already. Proposing
below update to MAINTAINERS:

---
From: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Date: Wed, 25 Oct 2023 14:50:41 -0700
Subject: [PATCH] MAINTAINERS: Update entry for X86 HARDWARE VULNERABILITIES

Add Andrew Cooper to maintainers of hardware vulnerabilities
mitigations.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2894f0777537..bf8c8707b8f8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23382,6 +23382,7 @@ M:	Thomas Gleixner <tglx@linutronix.de>
 M:	Borislav Petkov <bp@alien8.de>
 M:	Peter Zijlstra <peterz@infradead.org>
 M:	Josh Poimboeuf <jpoimboe@kernel.org>
+M:	Andrew Cooper <andrew.cooper3@citrix.com>
 R:	Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
 S:	Maintained
 F:	Documentation/admin-guide/hw-vuln/
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW
  2023-10-25 22:07     ` Pawan Gupta
@ 2023-10-25 22:13       ` Andrew Cooper
  2023-10-27 13:48         ` Pawan Gupta
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Cooper @ 2023-10-25 22:13 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen, linux-kernel, linux-doc, kvm, Alyssa Milburn,
	Daniel Sneddon, antonio.gomez.iglesias, Alyssa Milburn

On 25/10/2023 11:07 pm, Pawan Gupta wrote:
> On Wed, Oct 25, 2023 at 10:10:41PM +0100, Andrew Cooper wrote:
>>> +.align L1_CACHE_BYTES, 0xcc
>>> +SYM_CODE_START_NOALIGN(mds_verw_sel)
>>> +	UNWIND_HINT_UNDEFINED
>>> +	ANNOTATE_NOENDBR
>>> +	.word __KERNEL_DS
>> You need another .align here.  Otherwise subsequent code will still
>> start in this cacheline and defeat the purpose of trying to keep it
>> separate.
> Right.
>
>>> +SYM_CODE_END(mds_verw_sel);
>> Thinking about it, should this really be CODE and not a data entry?
> Would that require adding a data equivalent of .entry.text and update
> KPTI to keep it mapped? Or is there an easier option?

Leave it right here in .entry.text , but try using SYM_DATA() and
friends.  See whether objtool vomits over the result or not.

And if objtool does vomit over the result, then leaving it as it is in
this patch with SYM_CODE() is good enough.

>
>> P.S. Please CC on the full series.  Far less effort than fishing the
>> rest off lore.
> I didn't realize get_maintainer.pl isn't doing that already. Proposing
> below update to MAINTAINERS:
>
> ---
> From: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> Date: Wed, 25 Oct 2023 14:50:41 -0700
> Subject: [PATCH] MAINTAINERS: Update entry for X86 HARDWARE VULNERABILITIES
>
> Add Andrew Cooper to maintainers of hardware vulnerabilities
> mitigations.
>
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> ---
>  MAINTAINERS | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2894f0777537..bf8c8707b8f8 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -23382,6 +23382,7 @@ M:	Thomas Gleixner <tglx@linutronix.de>
>  M:	Borislav Petkov <bp@alien8.de>
>  M:	Peter Zijlstra <peterz@infradead.org>
>  M:	Josh Poimboeuf <jpoimboe@kernel.org>
> +M:	Andrew Cooper <andrew.cooper3@citrix.com>

Oh, right.  Perhaps R rather than M seeing as I can't make any time
commitments, but sure.

~Andrew

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW
  2023-10-25 20:52 ` [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW Pawan Gupta
  2023-10-25 21:10   ` Andrew Cooper
@ 2023-10-26 13:44   ` Nikolay Borisov
  2023-10-26 13:58     ` Andrew Cooper
  1 sibling, 1 reply; 32+ messages in thread
From: Nikolay Borisov @ 2023-10-26 13:44 UTC (permalink / raw)
  To: Pawan Gupta, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf,
	Andy Lutomirski, Jonathan Corbet, Sean Christopherson,
	Paolo Bonzini, tony.luck, ak, tim.c.chen
  Cc: linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias, Alyssa Milburn, Andrew Cooper



<snip>
> +
> +.pushsection .entry.text, "ax"
> +
> +.align L1_CACHE_BYTES, 0xcc
> +SYM_CODE_START_NOALIGN(mds_verw_sel)
> +	UNWIND_HINT_UNDEFINED
> +	ANNOTATE_NOENDBR
> +	.word __KERNEL_DS
> +SYM_CODE_END(mds_verw_sel);
> +/* For KVM */
> +EXPORT_SYMBOL_GPL(mds_verw_sel);
> +
> +.popsection

<snip>

> diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> index c55cc243592e..005e69f93115 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -329,6 +329,21 @@
>   #endif
>   .endm
>   
> +/*
> + * Macros to execute VERW instruction that mitigate transient data sampling
> + * attacks such as MDS. On affected systems a microcode update overloaded VERW
> + * instruction to also clear the CPU buffers. VERW clobbers CFLAGS.ZF.
> + *
> + * Note: Only the memory operand variant of VERW clears the CPU buffers.
> + */
> +.macro EXEC_VERW
> +	verw _ASM_RIP(mds_verw_sel)
> +.endm
> +
> +.macro CLEAR_CPU_BUFFERS
> +	ALTERNATIVE "", __stringify(EXEC_VERW), X86_FEATURE_CLEAR_CPU_BUF
> +.endm


What happened with the first 5 bytes of a 7 byte nop being complemented 
by __KERNEL_DS in order to handle VERW being executed after user 
registers are restored and having its memory operand ?

> +
>   #else /* __ASSEMBLY__ */
>   
>   #define ANNOTATE_RETPOLINE_SAFE					\
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW
  2023-10-26 13:44   ` Nikolay Borisov
@ 2023-10-26 13:58     ` Andrew Cooper
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Cooper @ 2023-10-26 13:58 UTC (permalink / raw)
  To: Nikolay Borisov, Pawan Gupta, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski, Jonathan Corbet,
	Sean Christopherson, Paolo Bonzini, tony.luck, ak, tim.c.chen
  Cc: linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias, Alyssa Milburn

On 26/10/2023 2:44 pm, Nikolay Borisov wrote:
>
>
> <snip>
>> +
>> +.pushsection .entry.text, "ax"
>> +
>> +.align L1_CACHE_BYTES, 0xcc
>> +SYM_CODE_START_NOALIGN(mds_verw_sel)
>> +    UNWIND_HINT_UNDEFINED
>> +    ANNOTATE_NOENDBR
>> +    .word __KERNEL_DS
>> +SYM_CODE_END(mds_verw_sel);
>> +/* For KVM */
>> +EXPORT_SYMBOL_GPL(mds_verw_sel);
>> +
>> +.popsection
>
> <snip>
>
>> diff --git a/arch/x86/include/asm/nospec-branch.h
>> b/arch/x86/include/asm/nospec-branch.h
>> index c55cc243592e..005e69f93115 100644
>> --- a/arch/x86/include/asm/nospec-branch.h
>> +++ b/arch/x86/include/asm/nospec-branch.h
>> @@ -329,6 +329,21 @@
>>   #endif
>>   .endm
>>   +/*
>> + * Macros to execute VERW instruction that mitigate transient data
>> sampling
>> + * attacks such as MDS. On affected systems a microcode update
>> overloaded VERW
>> + * instruction to also clear the CPU buffers. VERW clobbers CFLAGS.ZF.
>> + *
>> + * Note: Only the memory operand variant of VERW clears the CPU
>> buffers.
>> + */
>> +.macro EXEC_VERW
>> +    verw _ASM_RIP(mds_verw_sel)
>> +.endm
>> +
>> +.macro CLEAR_CPU_BUFFERS
>> +    ALTERNATIVE "", __stringify(EXEC_VERW), X86_FEATURE_CLEAR_CPU_BUF
>> +.endm
>
>
> What happened with the first 5 bytes of a 7 byte nop being
> complemented by __KERNEL_DS in order to handle VERW being executed
> after user registers are restored and having its memory operand ?

It was moved out of line (so no need to hide a constant in a nop),
deduped, and renamed to mds_verw_sel.

verw _ASM_RIP(mds_verw_sel) *is* the memory form.

~Andrew

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation
  2023-10-25 20:53 ` [PATCH v3 6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation Pawan Gupta
@ 2023-10-26 16:14   ` Nikolay Borisov
  2023-10-26 19:07     ` Pawan Gupta
  2023-10-26 19:30   ` Sean Christopherson
  1 sibling, 1 reply; 32+ messages in thread
From: Nikolay Borisov @ 2023-10-26 16:14 UTC (permalink / raw)
  To: Pawan Gupta, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf,
	Andy Lutomirski, Jonathan Corbet, Sean Christopherson,
	Paolo Bonzini, tony.luck, ak, tim.c.chen
  Cc: linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias



On 25.10.23 г. 23:53 ч., Pawan Gupta wrote:
> During VMentry VERW is executed to mitigate MDS. After VERW, any memory
> access like register push onto stack may put host data in MDS affected
> CPU buffers. A guest can then use MDS to sample host data.
> 
> Although likelihood of secrets surviving in registers at current VERW
> callsite is less, but it can't be ruled out. Harden the MDS mitigation
> by moving the VERW mitigation late in VMentry path.
> 
> Note that VERW for MMIO Stale Data mitigation is unchanged because of
> the complexity of per-guest conditional VERW which is not easy to handle
> that late in asm with no GPRs available. If the CPU is also affected by
> MDS, VERW is unconditionally executed late in asm regardless of guest
> having MMIO access.
> 
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> ---
>   arch/x86/kvm/vmx/vmenter.S |  3 +++
>   arch/x86/kvm/vmx/vmx.c     | 10 +++++++---
>   2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> index b3b13ec04bac..139960deb736 100644
> --- a/arch/x86/kvm/vmx/vmenter.S
> +++ b/arch/x86/kvm/vmx/vmenter.S
> @@ -161,6 +161,9 @@ SYM_FUNC_START(__vmx_vcpu_run)
>   	/* Load guest RAX.  This kills the @regs pointer! */
>   	mov VCPU_RAX(%_ASM_AX), %_ASM_AX
>   
> +	/* Clobbers EFLAGS.ZF */
> +	CLEAR_CPU_BUFFERS
> +
>   	/* Check EFLAGS.CF from the VMX_RUN_VMRESUME bit test above. */
>   	jnc .Lvmlaunch
>   
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 24e8694b83fc..2d149589cf5b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7226,13 +7226,17 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>   
>   	guest_state_enter_irqoff();
>   
> -	/* L1D Flush includes CPU buffer clear to mitigate MDS */
> +	/*
> +	 * L1D Flush includes CPU buffer clear to mitigate MDS, but VERW
> +	 * mitigation for MDS is done late in VMentry and is still
> +	 * executed inspite of L1D Flush. This is because an extra VERW
> +	 * should not matter much after the big hammer L1D Flush.
> +	 */
>   	if (static_branch_unlikely(&vmx_l1d_should_flush))
>   		vmx_l1d_flush(vcpu);
> -	else if (cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
> -		mds_clear_cpu_buffers();
>   	else if (static_branch_unlikely(&mmio_stale_data_clear) &&
>   		 kvm_arch_has_assigned_device(vcpu->kvm))
> +		/* MMIO mitigation is mutually exclusive with MDS mitigation later in asm */

Mutually exclusive implies that you have one or the other but not both, 
whilst I think the right formulation here is redundant? Because if mmio 
is enabled  mds_clear_cpu_buffers() will clear the buffers here  and 
later they'll be cleared again, no ? Alternatively you might augment 
this check to only execute iff X86_FEATURE_CLEAR_CPU_BUF is not set?

>   		mds_clear_cpu_buffers();
>   
>   	vmx_disable_fb_clear(vmx);
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/6] x86/entry_64: Add VERW just before userspace transition
  2023-10-25 20:52 ` [PATCH v3 2/6] x86/entry_64: Add VERW just before userspace transition Pawan Gupta
@ 2023-10-26 16:25   ` Nikolay Borisov
  2023-10-26 19:29     ` Pawan Gupta
  0 siblings, 1 reply; 32+ messages in thread
From: Nikolay Borisov @ 2023-10-26 16:25 UTC (permalink / raw)
  To: Pawan Gupta, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf,
	Andy Lutomirski, Jonathan Corbet, Sean Christopherson,
	Paolo Bonzini, tony.luck, ak, tim.c.chen
  Cc: linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias, Dave Hansen



On 25.10.23 г. 23:52 ч., Pawan Gupta wrote:

<snip>

> @@ -1520,6 +1530,7 @@ SYM_CODE_START(ignore_sysret)
>   	UNWIND_HINT_END_OF_STACK
>   	ENDBR
>   	mov	$-ENOSYS, %eax
> +	CLEAR_CPU_BUFFERS

nit: Just out of curiosity is it really needed in this case or it's 
doesn for the sake of uniformity so that all ring3 transitions are 
indeed covered??

>   	sysretl
>   SYM_CODE_END(ignore_sysret)
>   #endif
> diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
> index 70150298f8bd..245697eb8485 100644
> --- a/arch/x86/entry/entry_64_compat.S
> +++ b/arch/x86/entry/entry_64_compat.S
> @@ -271,6 +271,7 @@ SYM_INNER_LABEL(entry_SYSRETL_compat_unsafe_stack, SYM_L_GLOBAL)
>   	xorl	%r9d, %r9d
>   	xorl	%r10d, %r10d
>   	swapgs
> +	CLEAR_CPU_BUFFERS
>   	sysretl
>   SYM_INNER_LABEL(entry_SYSRETL_compat_end, SYM_L_GLOBAL)
>   	ANNOTATE_NOENDBR
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation
  2023-10-26 16:14   ` Nikolay Borisov
@ 2023-10-26 19:07     ` Pawan Gupta
  0 siblings, 0 replies; 32+ messages in thread
From: Pawan Gupta @ 2023-10-26 19:07 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen, linux-kernel, linux-doc, kvm, Alyssa Milburn,
	Daniel Sneddon, antonio.gomez.iglesias

On Thu, Oct 26, 2023 at 07:14:18PM +0300, Nikolay Borisov wrote:
> >   	if (static_branch_unlikely(&vmx_l1d_should_flush))
> >   		vmx_l1d_flush(vcpu);
> > -	else if (cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
> > -		mds_clear_cpu_buffers();
> >   	else if (static_branch_unlikely(&mmio_stale_data_clear) &&
> >   		 kvm_arch_has_assigned_device(vcpu->kvm))
> > +		/* MMIO mitigation is mutually exclusive with MDS mitigation later in asm */
> 
> Mutually exclusive implies that you have one or the other but not both,
> whilst I think the right formulation here is redundant? Because if mmio is
> enabled  mds_clear_cpu_buffers() will clear the buffers here  and later
> they'll be cleared again, no ?

No, because when mmio_stale_data_clear is enabled,
X86_FEATURE_CLEAR_CPU_BUF will not be set because of how mitigation is
selected in mmio_select_mitigation():

mmio_select_mitigation()
{
...
         /*
          * Enable CPU buffer clear mitigation for host and VMM if also affected
          * by MDS or TAA. Otherwise, enable mitigation for VMM only.
          */
         if (boot_cpu_has_bug(X86_BUG_MDS) || (boot_cpu_has_bug(X86_BUG_TAA) &&
                                               boot_cpu_has(X86_FEATURE_RTM)))
                 setup_force_cpu_cap(X86_FEATURE_CLEAR_CPU_BUF);
         else
                 static_branch_enable(&mmio_stale_data_clear);

> Alternatively you might augment this check to only execute iff
> X86_FEATURE_CLEAR_CPU_BUF is not set?

It already is like that due to the logic above. That is what the
comment:

	/* MMIO mitigation is mutually exclusive with MDS mitigation later in asm */

... is trying to convey. Suggestions welcome to improve the comment.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/6] x86/entry_64: Add VERW just before userspace transition
  2023-10-26 16:25   ` Nikolay Borisov
@ 2023-10-26 19:29     ` Pawan Gupta
  2023-10-26 19:40       ` Dave Hansen
  0 siblings, 1 reply; 32+ messages in thread
From: Pawan Gupta @ 2023-10-26 19:29 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen, linux-kernel, linux-doc, kvm, Alyssa Milburn,
	Daniel Sneddon, antonio.gomez.iglesias, Dave Hansen

On Thu, Oct 26, 2023 at 07:25:27PM +0300, Nikolay Borisov wrote:
> 
> 
> On 25.10.23 г. 23:52 ч., Pawan Gupta wrote:
> 
> <snip>
> 
> > @@ -1520,6 +1530,7 @@ SYM_CODE_START(ignore_sysret)
> >   	UNWIND_HINT_END_OF_STACK
> >   	ENDBR
> >   	mov	$-ENOSYS, %eax
> > +	CLEAR_CPU_BUFFERS
> 
> nit: Just out of curiosity is it really needed in this case or it's doesn
> for the sake of uniformity so that all ring3 transitions are indeed
> covered??

Interrupts returning to kernel don't clear the CPU buffers. I believe
interrupts will be enabled here, and getting an interrupt here could
leak the data that interrupt touched.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH  v3 6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation
  2023-10-25 20:53 ` [PATCH v3 6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation Pawan Gupta
  2023-10-26 16:14   ` Nikolay Borisov
@ 2023-10-26 19:30   ` Sean Christopherson
  2023-10-26 20:17     ` Sean Christopherson
  2023-10-26 20:48     ` Pawan Gupta
  1 sibling, 2 replies; 32+ messages in thread
From: Sean Christopherson @ 2023-10-26 19:30 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Paolo Bonzini, tony.luck, ak, tim.c.chen,
	linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias

On Wed, Oct 25, 2023, Pawan Gupta wrote:
> During VMentry VERW is executed to mitigate MDS. After VERW, any memory
> access like register push onto stack may put host data in MDS affected
> CPU buffers. A guest can then use MDS to sample host data.
> 
> Although likelihood of secrets surviving in registers at current VERW
> callsite is less, but it can't be ruled out. Harden the MDS mitigation
> by moving the VERW mitigation late in VMentry path.
> 
> Note that VERW for MMIO Stale Data mitigation is unchanged because of
> the complexity of per-guest conditional VERW which is not easy to handle
> that late in asm with no GPRs available. If the CPU is also affected by
> MDS, VERW is unconditionally executed late in asm regardless of guest
> having MMIO access.
> 
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> ---
>  arch/x86/kvm/vmx/vmenter.S |  3 +++
>  arch/x86/kvm/vmx/vmx.c     | 10 +++++++---
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> index b3b13ec04bac..139960deb736 100644
> --- a/arch/x86/kvm/vmx/vmenter.S
> +++ b/arch/x86/kvm/vmx/vmenter.S
> @@ -161,6 +161,9 @@ SYM_FUNC_START(__vmx_vcpu_run)
>  	/* Load guest RAX.  This kills the @regs pointer! */
>  	mov VCPU_RAX(%_ASM_AX), %_ASM_AX
>  
> +	/* Clobbers EFLAGS.ZF */
> +	CLEAR_CPU_BUFFERS
> +
>  	/* Check EFLAGS.CF from the VMX_RUN_VMRESUME bit test above. */
>  	jnc .Lvmlaunch
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 24e8694b83fc..2d149589cf5b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7226,13 +7226,17 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>  
>  	guest_state_enter_irqoff();
>  
> -	/* L1D Flush includes CPU buffer clear to mitigate MDS */
> +	/*
> +	 * L1D Flush includes CPU buffer clear to mitigate MDS, but VERW
> +	 * mitigation for MDS is done late in VMentry and is still
> +	 * executed inspite of L1D Flush. This is because an extra VERW

in spite

> +	 * should not matter much after the big hammer L1D Flush.
> +	 */
>  	if (static_branch_unlikely(&vmx_l1d_should_flush))
>  		vmx_l1d_flush(vcpu);

There's an existing bug here.  vmx_1ld_flush() is not guaranteed to do a flush in
"conditional mode", and is not guaranteed to do a ucode-based flush (though I can't
tell if it's possible for the VERW magic to exist without X86_FEATURE_FLUSH_L1D).

If we care, something like the diff at the bottom is probably needed.

> -	else if (cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
> -		mds_clear_cpu_buffers();
>  	else if (static_branch_unlikely(&mmio_stale_data_clear) &&
>  		 kvm_arch_has_assigned_device(vcpu->kvm))
> +		/* MMIO mitigation is mutually exclusive with MDS mitigation later in asm */

Please don't put comments inside an if/elif without curly braces (and I don't
want to add curly braces).  Though I think that's a moot point if we first fix
the conditional L1D flush issue.  E.g. when the dust settles we can end up with:

	/*
	 * Note, a ucode-based L1D flush also flushes CPU buffers, i.e. the
	 * manual VERW in __vmx_vcpu_run() to mitigate MDS *may* be redundant.
	 * But an L1D Flush is not guaranteed for "conditional mode", and the
	 * cost of an extra VERW after a full L1D flush is negligible.
	 */
	if (static_branch_unlikely(&vmx_l1d_should_flush))
		cpu_buffers_flushed = vmx_l1d_flush(vcpu);

	/*
	 * The MMIO stale data vulnerability is a subset of the general MDS
	 * vulnerability, i.e. this is mutually exclusive with the VERW that's
	 * done just before VM-Enter.  The vulnerability requires the attacker,
	 * i.e. the guest, to do MMIO, so this "clear" can be done earlier.
	 */
	if (static_branch_unlikely(&mmio_stale_data_clear) &&
	    !cpu_buffers_flushed && kvm_arch_has_assigned_device(vcpu->kvm))
		mds_clear_cpu_buffers();

>  		mds_clear_cpu_buffers();
>  
>  	vmx_disable_fb_clear(vmx);

LOL, nice.  IIUC, setting FB_CLEAR_DIS is mutually exclusive with doing a late
VERW, as KVM will never set FB_CLEAR_DIS if the CPU is susceptible to X86_BUG_MDS.
But the checks aren't identical, which makes this _look_ sketchy.

Can you do something like this to ensure we don't accidentally neuter the late VERW?

static void vmx_update_fb_clear_dis(struct kvm_vcpu *vcpu, struct vcpu_vmx *vmx)
{
	vmx->disable_fb_clear = (host_arch_capabilities & ARCH_CAP_FB_CLEAR_CTRL) &&
				!boot_cpu_has_bug(X86_BUG_MDS) &&
				!boot_cpu_has_bug(X86_BUG_TAA);

	if (vmx->disable_fb_clear &&
	    WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF)))
	    	vmx->disable_fb_clear = false;

	...
}

--
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6e502ba93141..cf6e06bb8310 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6606,8 +6606,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
  * is not exactly LRU. This could be sized at runtime via topology
  * information but as all relevant affected CPUs have 32KiB L1D cache size
  * there is no point in doing so.
+ *
+ * Returns %true if CPU buffers were cleared, i.e. if a microcode-based L1D
+ * flush was executed (which also clears CPU buffers).
  */
-static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
+static noinstr bool vmx_l1d_flush(struct kvm_vcpu *vcpu)
 {
        int size = PAGE_SIZE << L1D_CACHE_ORDER;
 
@@ -6634,14 +6637,14 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
                kvm_clear_cpu_l1tf_flush_l1d();
 
                if (!flush_l1d)
-                       return;
+                       return false;
        }
 
        vcpu->stat.l1d_flush++;
 
        if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
                native_wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
-               return;
+               return true;
        }
 
        asm volatile(
@@ -6665,6 +6668,8 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
                :: [flush_pages] "r" (vmx_l1d_flush_pages),
                    [size] "r" (size)
                : "eax", "ebx", "ecx", "edx");
+
+       return false;
 }
 
 static void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
@@ -7222,16 +7227,17 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
                                        unsigned int flags)
 {
        struct vcpu_vmx *vmx = to_vmx(vcpu);
+       bool cpu_buffers_flushed = false;
 
        guest_state_enter_irqoff();
 
-       /* L1D Flush includes CPU buffer clear to mitigate MDS */
        if (static_branch_unlikely(&vmx_l1d_should_flush))
-               vmx_l1d_flush(vcpu);
-       else if (static_branch_unlikely(&mds_user_clear))
-               mds_clear_cpu_buffers();
-       else if (static_branch_unlikely(&mmio_stale_data_clear) &&
-                kvm_arch_has_assigned_device(vcpu->kvm))
+               cpu_buffers_flushed = vmx_l1d_flush(vcpu);
+
+       if ((static_branch_unlikely(&mds_user_clear) ||
+            (static_branch_unlikely(&mmio_stale_data_clear) &&
+             kvm_arch_has_assigned_device(vcpu->kvm))) &&
+           !cpu_buffers_flushed)
                mds_clear_cpu_buffers();
 
        vmx_disable_fb_clear(vmx);


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/6] x86/entry_64: Add VERW just before userspace transition
  2023-10-26 19:29     ` Pawan Gupta
@ 2023-10-26 19:40       ` Dave Hansen
  2023-10-26 21:15         ` Pawan Gupta
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2023-10-26 19:40 UTC (permalink / raw)
  To: Pawan Gupta, Nikolay Borisov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen, linux-kernel, linux-doc, kvm, Alyssa Milburn,
	Daniel Sneddon, antonio.gomez.iglesias

On 10/26/23 12:29, Pawan Gupta wrote:
> On Thu, Oct 26, 2023 at 07:25:27PM +0300, Nikolay Borisov wrote:
>> On 25.10.23 г. 23:52 ч., Pawan Gupta wrote:
>>> @@ -1520,6 +1530,7 @@ SYM_CODE_START(ignore_sysret)
>>>   	UNWIND_HINT_END_OF_STACK
>>>   	ENDBR
>>>   	mov	$-ENOSYS, %eax
>>> +	CLEAR_CPU_BUFFERS
>> nit: Just out of curiosity is it really needed in this case or it's doesn
>> for the sake of uniformity so that all ring3 transitions are indeed
>> covered??
> Interrupts returning to kernel don't clear the CPU buffers. I believe
> interrupts will be enabled here, and getting an interrupt here could
> leak the data that interrupt touched.

Specifically NMIs, right?

X86_EFLAGS_IF should be clear here.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH  v3 6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation
  2023-10-26 19:30   ` Sean Christopherson
@ 2023-10-26 20:17     ` Sean Christopherson
  2023-10-26 21:27       ` Pawan Gupta
  2023-10-26 20:48     ` Pawan Gupta
  1 sibling, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2023-10-26 20:17 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Paolo Bonzini, tony.luck, ak, tim.c.chen,
	linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias

On Thu, Oct 26, 2023, Sean Christopherson wrote:
> On Wed, Oct 25, 2023, Pawan Gupta wrote:
> >  	vmx_disable_fb_clear(vmx);
> 
> LOL, nice.  IIUC, setting FB_CLEAR_DIS is mutually exclusive with doing a late
> VERW, as KVM will never set FB_CLEAR_DIS if the CPU is susceptible to X86_BUG_MDS.
> But the checks aren't identical, which makes this _look_ sketchy.
> 
> Can you do something like this to ensure we don't accidentally neuter the late VERW?
> 
> static void vmx_update_fb_clear_dis(struct kvm_vcpu *vcpu, struct vcpu_vmx *vmx)
> {
> 	vmx->disable_fb_clear = (host_arch_capabilities & ARCH_CAP_FB_CLEAR_CTRL) &&
> 				!boot_cpu_has_bug(X86_BUG_MDS) &&
> 				!boot_cpu_has_bug(X86_BUG_TAA);
> 
> 	if (vmx->disable_fb_clear &&
> 	    WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF)))
> 	    	vmx->disable_fb_clear = false;
> 
> 	...
> }

Alternatively, and maybe even preferably, this would make it more obvious that
the two are mutually exclusive and would also be a (very, very) small perf win
when the mitigation is enabled.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0936516cb93b..592103df1754 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7236,7 +7236,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
                 kvm_arch_has_assigned_device(vcpu->kvm))
                mds_clear_cpu_buffers();
 
-       vmx_disable_fb_clear(vmx);
+       if (!cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
+               vmx_disable_fb_clear(vmx);
 
        if (vcpu->arch.cr2 != native_read_cr2())
                native_write_cr2(vcpu->arch.cr2);
@@ -7249,7 +7250,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 
        vmx->idt_vectoring_info = 0;
 
-       vmx_enable_fb_clear(vmx);
+       if (!cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
+               vmx_enable_fb_clear(vmx);
 
        if (unlikely(vmx->fail)) {
                vmx->exit_reason.full = 0xdead;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH  v3 6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation
  2023-10-26 19:30   ` Sean Christopherson
  2023-10-26 20:17     ` Sean Christopherson
@ 2023-10-26 20:48     ` Pawan Gupta
  2023-10-26 21:22       ` Sean Christopherson
  1 sibling, 1 reply; 32+ messages in thread
From: Pawan Gupta @ 2023-10-26 20:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Paolo Bonzini, tony.luck, ak, tim.c.chen,
	linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias

On Thu, Oct 26, 2023 at 12:30:55PM -0700, Sean Christopherson wrote:
> > -	/* L1D Flush includes CPU buffer clear to mitigate MDS */
> > +	/*
> > +	 * L1D Flush includes CPU buffer clear to mitigate MDS, but VERW
> > +	 * mitigation for MDS is done late in VMentry and is still
> > +	 * executed inspite of L1D Flush. This is because an extra VERW
> 
> in spite

Ok.

> > +	 * should not matter much after the big hammer L1D Flush.
> > +	 */
> >  	if (static_branch_unlikely(&vmx_l1d_should_flush))
> >  		vmx_l1d_flush(vcpu);
> 
> There's an existing bug here.  vmx_1ld_flush() is not guaranteed to do a flush in
> "conditional mode", and is not guaranteed to do a ucode-based flush

AFAICT, it is based on the condition whether after a VMexit any
sensitive data could have been touched or not. If L1TF mitigation
doesn't consider certain data sensitive and skips L1D flush, executing
VERW isn't giving any protection, since that data can anyways be leaked
from L1D using L1TF.

> (though I can't tell if it's possible for the VERW magic to exist
> without X86_FEATURE_FLUSH_L1D).

Likely not, ucode that adds VERW should have X86_FEATURE_FLUSH_L1D as
L1TF was mitigation prior to MDS.

> If we care, something like the diff at the bottom is probably needed.
> 
> > -	else if (cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
> > -		mds_clear_cpu_buffers();
> >  	else if (static_branch_unlikely(&mmio_stale_data_clear) &&
> >  		 kvm_arch_has_assigned_device(vcpu->kvm))
> > +		/* MMIO mitigation is mutually exclusive with MDS mitigation later in asm */
> 
> Please don't put comments inside an if/elif without curly braces (and I don't
> want to add curly braces).  Though I think that's a moot point if we first fix
> the conditional L1D flush issue.  E.g. when the dust settles we can end up with:

Ok.

> 	/*
> 	 * Note, a ucode-based L1D flush also flushes CPU buffers, i.e. the
> 	 * manual VERW in __vmx_vcpu_run() to mitigate MDS *may* be redundant.
> 	 * But an L1D Flush is not guaranteed for "conditional mode", and the
> 	 * cost of an extra VERW after a full L1D flush is negligible.
> 	 */
> 	if (static_branch_unlikely(&vmx_l1d_should_flush))
> 		cpu_buffers_flushed = vmx_l1d_flush(vcpu);
> 
> 	/*
> 	 * The MMIO stale data vulnerability is a subset of the general MDS
> 	 * vulnerability, i.e. this is mutually exclusive with the VERW that's
> 	 * done just before VM-Enter.  The vulnerability requires the attacker,
> 	 * i.e. the guest, to do MMIO, so this "clear" can be done earlier.
> 	 */
> 	if (static_branch_unlikely(&mmio_stale_data_clear) &&
> 	    !cpu_buffers_flushed && kvm_arch_has_assigned_device(vcpu->kvm))
> 		mds_clear_cpu_buffers();

This is certainly better, but I don't know what scenario is this helping with.

> >  		mds_clear_cpu_buffers();
> >  
> >  	vmx_disable_fb_clear(vmx);
> 
> LOL, nice.  IIUC, setting FB_CLEAR_DIS is mutually exclusive with doing a late
> VERW, as KVM will never set FB_CLEAR_DIS if the CPU is susceptible to X86_BUG_MDS.
> But the checks aren't identical, which makes this _look_ sketchy.
> 
> Can you do something like this to ensure we don't accidentally neuter the late VERW?
> 
> static void vmx_update_fb_clear_dis(struct kvm_vcpu *vcpu, struct vcpu_vmx *vmx)
> {
> 	vmx->disable_fb_clear = (host_arch_capabilities & ARCH_CAP_FB_CLEAR_CTRL) &&
> 				!boot_cpu_has_bug(X86_BUG_MDS) &&
> 				!boot_cpu_has_bug(X86_BUG_TAA);
> 
> 	if (vmx->disable_fb_clear &&
> 	    WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF)))
> 	    	vmx->disable_fb_clear = false;

Will do, this makes a lot of sense.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/6] x86/entry_64: Add VERW just before userspace transition
  2023-10-26 19:40       ` Dave Hansen
@ 2023-10-26 21:15         ` Pawan Gupta
  2023-10-26 22:13           ` Pawan Gupta
  0 siblings, 1 reply; 32+ messages in thread
From: Pawan Gupta @ 2023-10-26 21:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nikolay Borisov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf,
	Andy Lutomirski, Jonathan Corbet, Sean Christopherson,
	Paolo Bonzini, tony.luck, ak, tim.c.chen, linux-kernel,
	linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias

On Thu, Oct 26, 2023 at 12:40:49PM -0700, Dave Hansen wrote:
> On 10/26/23 12:29, Pawan Gupta wrote:
> > On Thu, Oct 26, 2023 at 07:25:27PM +0300, Nikolay Borisov wrote:
> >> On 25.10.23 г. 23:52 ч., Pawan Gupta wrote:
> >>> @@ -1520,6 +1530,7 @@ SYM_CODE_START(ignore_sysret)
> >>>   	UNWIND_HINT_END_OF_STACK
> >>>   	ENDBR
> >>>   	mov	$-ENOSYS, %eax
> >>> +	CLEAR_CPU_BUFFERS
> >> nit: Just out of curiosity is it really needed in this case or it's doesn
> >> for the sake of uniformity so that all ring3 transitions are indeed
> >> covered??
> > Interrupts returning to kernel don't clear the CPU buffers. I believe
> > interrupts will be enabled here, and getting an interrupt here could
> > leak the data that interrupt touched.
> 
> Specifically NMIs, right?

Yes, and VERW can omitted for the same reason as NMI returning to
kernel.

> X86_EFLAGS_IF should be clear here.

I see that SYSCALL has a configuration for IF, but I didn't see it for
SYSENTER in the code. But looking at the SDM, it clear IF by default.

syscall_init()
{
...
#else
	wrmsrl_cstar((unsigned long)ignore_sysret);
	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
#endif

	/*
	 * Flags to clear on syscall; clear as much as possible
	 * to minimize user space-kernel interference.
	 */
	wrmsrl(MSR_SYSCALL_MASK,
	       X86_EFLAGS_CF|X86_EFLAGS_PF|X86_EFLAGS_AF|
	       X86_EFLAGS_ZF|X86_EFLAGS_SF|X86_EFLAGS_TF|
	       X86_EFLAGS_IF|X86_EFLAGS_DF|X86_EFLAGS_OF|
	       X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_RF|
	       X86_EFLAGS_AC|X86_EFLAGS_ID);

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH  v3 6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation
  2023-10-26 20:48     ` Pawan Gupta
@ 2023-10-26 21:22       ` Sean Christopherson
  2023-10-26 22:03         ` Pawan Gupta
  0 siblings, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2023-10-26 21:22 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Paolo Bonzini, tony.luck, ak, tim.c.chen,
	linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias

On Thu, Oct 26, 2023, Pawan Gupta wrote:
> On Thu, Oct 26, 2023 at 12:30:55PM -0700, Sean Christopherson wrote:
> > >  	if (static_branch_unlikely(&vmx_l1d_should_flush))
> > >  		vmx_l1d_flush(vcpu);
> > 
> > There's an existing bug here.  vmx_1ld_flush() is not guaranteed to do a flush in
> > "conditional mode", and is not guaranteed to do a ucode-based flush
> 
> AFAICT, it is based on the condition whether after a VMexit any
> sensitive data could have been touched or not. If L1TF mitigation
> doesn't consider certain data sensitive and skips L1D flush, executing
> VERW isn't giving any protection, since that data can anyways be leaked
> from L1D using L1TF.

That assumes vcpu->arch.l1tf_flush_l1d is 100% precise and accurate, which is most
definitely not the case.  You're also preventing the admin from choosing between
being super paranoind (always flush L1D) and mostly paranoid (conditionally flush
L1D, always flush CPU buffers).

AIUI, flushing the L1D is crazy expensive compared to flushing the CPU buffers,
so it's entirely plausible for someone to want to choose the mostly paranoid
option.

Side topic, isn't the NMI path missing a call to kvm_set_cpu_l1tf_flush_l1d()?

> > 	/*
> > 	 * The MMIO stale data vulnerability is a subset of the general MDS
> > 	 * vulnerability, i.e. this is mutually exclusive with the VERW that's
> > 	 * done just before VM-Enter.  The vulnerability requires the attacker,
> > 	 * i.e. the guest, to do MMIO, so this "clear" can be done earlier.
> > 	 */
> > 	if (static_branch_unlikely(&mmio_stale_data_clear) &&
> > 	    !cpu_buffers_flushed && kvm_arch_has_assigned_device(vcpu->kvm))
> > 		mds_clear_cpu_buffers();
> 
> This is certainly better, but I don't know what scenario is this helping with.

Heh, that's host I feel about moving VERW to just before VM-Enter.  I have a hard
time believing there's meaningful sensitive that's accessed in __vmx_vcpu_run().
The closest thing is probably CR2, but that's a very dubious vector since CR2 will
hold a guest value for most VM-Enters.

I'm not against moving VERW close to VM-Enter because it's relatively straightforward,
but if we're going to be super paranoid, why not go all the way and not have to
worry about what ifs?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH  v3 6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation
  2023-10-26 20:17     ` Sean Christopherson
@ 2023-10-26 21:27       ` Pawan Gupta
  0 siblings, 0 replies; 32+ messages in thread
From: Pawan Gupta @ 2023-10-26 21:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Paolo Bonzini, tony.luck, ak, tim.c.chen,
	linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias

On Thu, Oct 26, 2023 at 01:17:47PM -0700, Sean Christopherson wrote:
> Alternatively, and maybe even preferably, this would make it more obvious that
> the two are mutually exclusive and would also be a (very, very) small perf win
> when the mitigation is enabled.

Agree.

> -       vmx_disable_fb_clear(vmx);
> +       if (!cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
> +               vmx_disable_fb_clear(vmx);


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH  v3 6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation
  2023-10-26 21:22       ` Sean Christopherson
@ 2023-10-26 22:03         ` Pawan Gupta
  0 siblings, 0 replies; 32+ messages in thread
From: Pawan Gupta @ 2023-10-26 22:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Paolo Bonzini, tony.luck, ak, tim.c.chen,
	linux-kernel, linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias

On Thu, Oct 26, 2023 at 02:22:58PM -0700, Sean Christopherson wrote:
> On Thu, Oct 26, 2023, Pawan Gupta wrote:
> > On Thu, Oct 26, 2023 at 12:30:55PM -0700, Sean Christopherson wrote:
> > > >  	if (static_branch_unlikely(&vmx_l1d_should_flush))
> > > >  		vmx_l1d_flush(vcpu);
> > > 
> > > There's an existing bug here.  vmx_1ld_flush() is not guaranteed to do a flush in
> > > "conditional mode", and is not guaranteed to do a ucode-based flush
> > 
> > AFAICT, it is based on the condition whether after a VMexit any
> > sensitive data could have been touched or not. If L1TF mitigation
> > doesn't consider certain data sensitive and skips L1D flush, executing
> > VERW isn't giving any protection, since that data can anyways be leaked
> > from L1D using L1TF.
> 
> That assumes vcpu->arch.l1tf_flush_l1d is 100% precise and accurate, which is most
> definitely not the case.  You're also preventing the admin from choosing between
> being super paranoind (always flush L1D) and mostly paranoid (conditionally flush
> L1D, always flush CPU buffers).
> AIUI, flushing the L1D is crazy expensive compared to flushing the CPU buffers,
> so it's entirely plausible for someone to want to choose the mostly paranoid
> option.

Sure, if it helps an admin. I was asking about the problematic scenario
out of curiosity. BTW, the changes you suggested are definitely worth
doing.

> Side topic, isn't the NMI path missing a call to kvm_set_cpu_l1tf_flush_l1d()?

Yes, it is missing. Not sure if it was omitted intentionally.

> > This is certainly better, but I don't know what scenario is this helping with.
> 
> Heh, that's host I feel about moving VERW to just before VM-Enter.  I have a hard
> time believing there's meaningful sensitive that's accessed in __vmx_vcpu_run().
> The closest thing is probably CR2, but that's a very dubious vector since CR2 will
> hold a guest value for most VM-Enters.

Yes, kernel->user case has a better chance of leaking anything.

> I'm not against moving VERW close to VM-Enter because it's relatively straightforward,
> but if we're going to be super paranoid, why not go all the way and not have to
> worry about what ifs?

Right. The VMenter changes are mostly done to be consistent with what is being
done for kernel->user.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/6] x86/entry_64: Add VERW just before userspace transition
  2023-10-26 21:15         ` Pawan Gupta
@ 2023-10-26 22:13           ` Pawan Gupta
  2023-10-26 22:17             ` Dave Hansen
  0 siblings, 1 reply; 32+ messages in thread
From: Pawan Gupta @ 2023-10-26 22:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nikolay Borisov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf,
	Andy Lutomirski, Jonathan Corbet, Sean Christopherson,
	Paolo Bonzini, tony.luck, ak, tim.c.chen, linux-kernel,
	linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias

On Thu, Oct 26, 2023 at 02:15:11PM -0700, Pawan Gupta wrote:
> On Thu, Oct 26, 2023 at 12:40:49PM -0700, Dave Hansen wrote:
> > On 10/26/23 12:29, Pawan Gupta wrote:
> > > On Thu, Oct 26, 2023 at 07:25:27PM +0300, Nikolay Borisov wrote:
> > >> On 25.10.23 г. 23:52 ч., Pawan Gupta wrote:
> > >>> @@ -1520,6 +1530,7 @@ SYM_CODE_START(ignore_sysret)
> > >>>   	UNWIND_HINT_END_OF_STACK
> > >>>   	ENDBR
> > >>>   	mov	$-ENOSYS, %eax
> > >>> +	CLEAR_CPU_BUFFERS
> > >> nit: Just out of curiosity is it really needed in this case or it's doesn
> > >> for the sake of uniformity so that all ring3 transitions are indeed
> > >> covered??
> > > Interrupts returning to kernel don't clear the CPU buffers. I believe
> > > interrupts will be enabled here, and getting an interrupt here could
> > > leak the data that interrupt touched.
> > 
> > Specifically NMIs, right?
> 
> Yes, and VERW can omitted for the same reason as NMI returning to
> kernel.

Thinking more on this, we should not omit verw here, as this spot is way
easier to target NMIs. A user executing SYSENTER in a loop has much
higher chances of causing an NMI to return to kernel, and skip verw.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/6] x86/entry_64: Add VERW just before userspace transition
  2023-10-26 22:13           ` Pawan Gupta
@ 2023-10-26 22:17             ` Dave Hansen
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Hansen @ 2023-10-26 22:17 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Nikolay Borisov, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf,
	Andy Lutomirski, Jonathan Corbet, Sean Christopherson,
	Paolo Bonzini, tony.luck, ak, tim.c.chen, linux-kernel,
	linux-doc, kvm, Alyssa Milburn, Daniel Sneddon,
	antonio.gomez.iglesias

On 10/26/23 15:13, Pawan Gupta wrote:
>>>> Interrupts returning to kernel don't clear the CPU buffers. I believe
>>>> interrupts will be enabled here, and getting an interrupt here could
>>>> leak the data that interrupt touched.
>>> Specifically NMIs, right?
>> Yes, and VERW can omitted for the same reason as NMI returning to
>> kernel.
> Thinking more on this, we should not omit verw here, as this spot is way
> easier to target NMIs. A user executing SYSENTER in a loop has much
> higher chances of causing an NMI to return to kernel, and skip verw.

Right.

This is also a path where we care *ZERO* about performance.  It's
basically all upside to _add_ VERW and all downside (increased attack
surface) to skip it.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW
  2023-10-25 22:13       ` Andrew Cooper
@ 2023-10-27 13:48         ` Pawan Gupta
  2023-10-27 14:12           ` Andrew Cooper
  0 siblings, 1 reply; 32+ messages in thread
From: Pawan Gupta @ 2023-10-27 13:48 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen, linux-kernel, linux-doc, kvm, Alyssa Milburn,
	Daniel Sneddon, antonio.gomez.iglesias, Alyssa Milburn

On Wed, Oct 25, 2023 at 11:13:46PM +0100, Andrew Cooper wrote:
> On 25/10/2023 11:07 pm, Pawan Gupta wrote:
> > On Wed, Oct 25, 2023 at 10:10:41PM +0100, Andrew Cooper wrote:
> >>> +.align L1_CACHE_BYTES, 0xcc
> >>> +SYM_CODE_START_NOALIGN(mds_verw_sel)
> >>> +	UNWIND_HINT_UNDEFINED
> >>> +	ANNOTATE_NOENDBR
> >>> +	.word __KERNEL_DS
> >> You need another .align here.  Otherwise subsequent code will still
> >> start in this cacheline and defeat the purpose of trying to keep it
> >> separate.
> > Right.
> >
> >>> +SYM_CODE_END(mds_verw_sel);
> >> Thinking about it, should this really be CODE and not a data entry?
> > Would that require adding a data equivalent of .entry.text and update
> > KPTI to keep it mapped? Or is there an easier option?
> 
> Leave it right here in .entry.text , but try using SYM_DATA() and
> friends.  See whether objtool vomits over the result or not.

objtool still complaints when using SYM_DATA*() without the annotations:

 vmlinux.o: warning: objtool: mds_verw_sel+0x0: unreachable instruction
 vmlinux.o: warning: objtool: .altinstr_replacement+0x2c: relocation to !ENDBR: mds_verw_sel+0x0

> And if objtool does vomit over the result, then leaving it as it is in
> this patch with SYM_CODE() is good enough.

Settling with SYM_CODE().

On the bright-side, I am seeing even better perf with VERW operand
out-of-line:

Baseline: v6.6-rc5

| Test               | Configuration          | v1   | v3   |
| ------------------ | ---------------------- | ---- | ---- |
| build-linux-kernel | defconfig              | 1.00 | 1.00 |
| hackbench          | 32 - Process           | 1.02 | 1.06 |
| nginx              | Short Connection - 500 | 1.01 | 1.04 |

Disclaimer: These are collected by a stupid dev who knows nothing about
perf, please take this with a grain of salt.

I will be sending v4 soon.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW
  2023-10-27 13:48         ` Pawan Gupta
@ 2023-10-27 14:12           ` Andrew Cooper
  2023-10-27 14:24             ` Pawan Gupta
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Cooper @ 2023-10-27 14:12 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen, linux-kernel, linux-doc, kvm, Alyssa Milburn,
	Daniel Sneddon, antonio.gomez.iglesias, Alyssa Milburn

On 27/10/2023 2:48 pm, Pawan Gupta wrote:
> On the bright-side, I am seeing even better perf with VERW operand
> out-of-line:
> 
> Baseline: v6.6-rc5
> 
> | Test               | Configuration          | v1   | v3   |
> | ------------------ | ---------------------- | ---- | ---- |
> | build-linux-kernel | defconfig              | 1.00 | 1.00 |
> | hackbench          | 32 - Process           | 1.02 | 1.06 |
> | nginx              | Short Connection - 500 | 1.01 | 1.04 |
> 
> Disclaimer: These are collected by a stupid dev who knows nothing about
> perf, please take this with a grain of salt.

:)

Almost as if it's a good idea to follow the advice of the Optimisation
Guide on mixing code and data, which is "don't".

~Andrew

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW
  2023-10-27 14:12           ` Andrew Cooper
@ 2023-10-27 14:24             ` Pawan Gupta
  0 siblings, 0 replies; 32+ messages in thread
From: Pawan Gupta @ 2023-10-27 14:24 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Josh Poimboeuf, Andy Lutomirski,
	Jonathan Corbet, Sean Christopherson, Paolo Bonzini, tony.luck,
	ak, tim.c.chen, linux-kernel, linux-doc, kvm, Alyssa Milburn,
	Daniel Sneddon, antonio.gomez.iglesias, Alyssa Milburn

On Fri, Oct 27, 2023 at 03:12:45PM +0100, Andrew Cooper wrote:
> Almost as if it's a good idea to follow the advice of the Optimisation
> Guide on mixing code and data, which is "don't".

Thanks a lot Andrew and Peter for shepherding me this way.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2023-10-27 14:24 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-25 20:52 [PATCH v3 0/6] Delay VERW Pawan Gupta
2023-10-25 20:52 ` [PATCH v3 1/6] x86/bugs: Add asm helpers for executing VERW Pawan Gupta
2023-10-25 21:10   ` Andrew Cooper
2023-10-25 21:28     ` Josh Poimboeuf
2023-10-25 21:30       ` Andrew Cooper
2023-10-25 21:49         ` Josh Poimboeuf
2023-10-25 22:07     ` Pawan Gupta
2023-10-25 22:13       ` Andrew Cooper
2023-10-27 13:48         ` Pawan Gupta
2023-10-27 14:12           ` Andrew Cooper
2023-10-27 14:24             ` Pawan Gupta
2023-10-26 13:44   ` Nikolay Borisov
2023-10-26 13:58     ` Andrew Cooper
2023-10-25 20:52 ` [PATCH v3 2/6] x86/entry_64: Add VERW just before userspace transition Pawan Gupta
2023-10-26 16:25   ` Nikolay Borisov
2023-10-26 19:29     ` Pawan Gupta
2023-10-26 19:40       ` Dave Hansen
2023-10-26 21:15         ` Pawan Gupta
2023-10-26 22:13           ` Pawan Gupta
2023-10-26 22:17             ` Dave Hansen
2023-10-25 20:53 ` [PATCH v3 3/6] x86/entry_32: " Pawan Gupta
2023-10-25 20:53 ` [PATCH v3 4/6] x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key Pawan Gupta
2023-10-25 20:53 ` [PATCH v3 5/6] KVM: VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH Pawan Gupta
2023-10-25 20:53 ` [PATCH v3 6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation Pawan Gupta
2023-10-26 16:14   ` Nikolay Borisov
2023-10-26 19:07     ` Pawan Gupta
2023-10-26 19:30   ` Sean Christopherson
2023-10-26 20:17     ` Sean Christopherson
2023-10-26 21:27       ` Pawan Gupta
2023-10-26 20:48     ` Pawan Gupta
2023-10-26 21:22       ` Sean Christopherson
2023-10-26 22:03         ` Pawan Gupta

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.