linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v1 00/26] Add TDX Guest Support
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 01/26] x86/paravirt: Introduce CONFIG_PARAVIRT_XL Kuppuswamy Sathyanarayanan
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Hi All,

NOTE: This series is not ready for wide public review. It is being
specifically posted so that Peter Z and other experts on the entry
code can look for problems with the new exception handler (#VE).
That's also why x86@ is not being spammed.

Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. This series adds the bare-minimum
support to run a TDX guest. The host-side support will be submitted
separately. Also support for advanced TD guest features like attestation
or debug-mode will be submitted separately. Also, at this point it is not
secure with some known holes in drivers, and also hasn’t been fully audited
and fuzzed yet.

TDX has a lot of similarities to SEV. It enhances confidentiality and
of guest memory and state (like registers) and includes a new exception
(#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
yet), TDX limits the host's ability to effect changes in the guest
physical address space.

In contrast to the SEV code in the kernel, TDX guest memory is integrity
protected and isolated; the host is prevented from accessing guest
memory (even ciphertext).

The TDX architecture also includes a new CPU mode called
Secure-Arbitration Mode (SEAM). The software (TDX module) running in this
mode arbitrates interactions between host and guest and implements many of
the guarantees of the TDX architecture.

Some of the key differences between TD and regular VM is,

1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
2. A new #VE exception handler is added. The TDX module injects #VE exception
   to the guest TD in cases of instructions that need to be emulated, disallowed
   MSR accesses, subset of CPUID leaves, etc.
3. By default memory is marked as private, and TD will selectively share it with
   VMM based on need.
4. Remote attestation is supported to enable a third party (either the owner of
   the workload or a user of the services provided by the workload) to establish
   that the workload is running on an Intel-TDX-enabled platform located within a
   TD prior to providing that workload data.

You can find TDX related documents in the following link.

https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html

This RFC series has been reviewed by Dave Hansen.

Kirill A. Shutemov (16):
  x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  x86/tdx: Get TD execution environment information via TDINFO
  x86/traps: Add #VE support for TDX guest
  x86/tdx: Add HLT support for TDX guest
  x86/tdx: Wire up KVM hypercalls
  x86/tdx: Add MSR support for TDX guest
  x86/tdx: Handle CPUID via #VE
  x86/io: Allow to override inX() and outX() implementation
  x86/tdx: Handle port I/O
  x86/tdx: Handle in-kernel MMIO
  x86/mm: Move force_dma_unencrypted() to common code
  x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  x86/tdx: Make pages shared in ioremap()
  x86/tdx: Add helper to do MapGPA TDVMALL
  x86/tdx: Make DMA pages shared
  x86/kvm: Use bounce buffers for TD guest

Kuppuswamy Sathyanarayanan (6):
  x86/cpufeatures: Add TDX Guest CPU feature
  x86/cpufeatures: Add is_tdx_guest() interface
  x86/tdx: Handle MWAIT, MONITOR and WBINVD
  ACPI: tables: Add multiprocessor wake-up support
  x86/topology: Disable CPU hotplug support for TDX platforms.
  x86/tdx: Introduce INTEL_TDX_GUEST config option

Sean Christopherson (4):
  x86/boot: Add a trampoline for APs booting in 64-bit mode
  x86/boot: Avoid #VE during compressed boot for TDX platforms
  x86/boot: Avoid unnecessary #VE during boot process
  x86/tdx: Forcefully disable legacy PIC for TDX guests

 arch/x86/Kconfig                         |  28 +-
 arch/x86/boot/compressed/Makefile        |   2 +
 arch/x86/boot/compressed/head_64.S       |  10 +-
 arch/x86/boot/compressed/misc.h          |   1 +
 arch/x86/boot/compressed/pgtable.h       |   2 +-
 arch/x86/boot/compressed/tdx.c           |  32 ++
 arch/x86/boot/compressed/tdx_io.S        |   9 +
 arch/x86/include/asm/apic.h              |   3 +
 arch/x86/include/asm/asm-prototypes.h    |   1 +
 arch/x86/include/asm/cpufeatures.h       |   1 +
 arch/x86/include/asm/idtentry.h          |   4 +
 arch/x86/include/asm/io.h                |  25 +-
 arch/x86/include/asm/irqflags.h          |  42 +-
 arch/x86/include/asm/kvm_para.h          |  21 +
 arch/x86/include/asm/paravirt.h          |  22 +-
 arch/x86/include/asm/paravirt_types.h    |   3 +-
 arch/x86/include/asm/pgtable.h           |   3 +
 arch/x86/include/asm/realmode.h          |   1 +
 arch/x86/include/asm/tdx.h               | 114 +++++
 arch/x86/kernel/Makefile                 |   1 +
 arch/x86/kernel/acpi/boot.c              |  56 +++
 arch/x86/kernel/apic/probe_32.c          |   8 +
 arch/x86/kernel/apic/probe_64.c          |   8 +
 arch/x86/kernel/head64.c                 |   3 +
 arch/x86/kernel/head_64.S                |  13 +-
 arch/x86/kernel/idt.c                    |   6 +
 arch/x86/kernel/paravirt.c               |   4 +-
 arch/x86/kernel/pci-swiotlb.c            |   2 +-
 arch/x86/kernel/smpboot.c                |   5 +
 arch/x86/kernel/tdx-kvm.c                | 116 +++++
 arch/x86/kernel/tdx.c                    | 560 +++++++++++++++++++++++
 arch/x86/kernel/tdx_io.S                 | 143 ++++++
 arch/x86/kernel/topology.c               |   3 +-
 arch/x86/kernel/traps.c                  |  73 ++-
 arch/x86/mm/Makefile                     |   2 +
 arch/x86/mm/ioremap.c                    |   8 +-
 arch/x86/mm/mem_encrypt.c                |  74 ---
 arch/x86/mm/mem_encrypt_common.c         |  83 ++++
 arch/x86/mm/mem_encrypt_identity.c       |   1 +
 arch/x86/mm/pat/set_memory.c             |  23 +-
 arch/x86/realmode/rm/header.S            |   1 +
 arch/x86/realmode/rm/trampoline_64.S     |  49 +-
 arch/x86/realmode/rm/trampoline_common.S |   5 +-
 drivers/acpi/tables.c                    |   9 +
 include/acpi/actbl2.h                    |  21 +-
 45 files changed, 1444 insertions(+), 157 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdx.c
 create mode 100644 arch/x86/boot/compressed/tdx_io.S
 create mode 100644 arch/x86/include/asm/tdx.h
 create mode 100644 arch/x86/kernel/tdx-kvm.c
 create mode 100644 arch/x86/kernel/tdx.c
 create mode 100644 arch/x86/kernel/tdx_io.S
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC v1 01/26] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
  2021-02-05 23:38 ` [RFC v1 00/26] Add TDX Guest Support Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 02/26] x86/cpufeatures: Add TDX Guest CPU feature Kuppuswamy Sathyanarayanan
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Split off halt paravirt calls from CONFIG_PARAVIRT_XXL into
a separate config option. It provides a middle ground for
not-so-deep paravirtulized environments.

CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
calls that was hidden under CONFIG_PARAVIRT_XXL, but the rest of the
config would be a bloat for TDX.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/Kconfig                      |  4 +++
 arch/x86/boot/compressed/misc.h       |  1 +
 arch/x86/include/asm/irqflags.h       | 42 +++++++++++++++------------
 arch/x86/include/asm/paravirt.h       | 22 +++++++-------
 arch/x86/include/asm/paravirt_types.h |  3 +-
 arch/x86/kernel/paravirt.c            |  4 ++-
 arch/x86/mm/mem_encrypt_identity.c    |  1 +
 7 files changed, 46 insertions(+), 31 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7b6dd10b162a..8fe91114bfee 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -776,8 +776,12 @@ config PARAVIRT
 	  over full virtualization.  However, when run without a hypervisor
 	  the kernel is theoretically slower and slightly larger.
 
+config PARAVIRT_XL
+	bool
+
 config PARAVIRT_XXL
 	bool
+	select PARAVIRT_XL
 
 config PARAVIRT_DEBUG
 	bool "paravirt-ops debugging"
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 901ea5ebec22..4b84abe43765 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -9,6 +9,7 @@
  * paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
+#undef CONFIG_PARAVIRT_XL
 #undef CONFIG_PARAVIRT_XXL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 #undef CONFIG_KASAN
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 2dfc8d380dab..299c9b1ed857 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -68,11 +68,33 @@ static inline __cpuidle void native_halt(void)
 
 #endif
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_XL
 #include <asm/paravirt.h>
 #else
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
+/*
+ * Used in the idle loop; sti takes one instruction cycle
+ * to complete:
+ */
+static inline __cpuidle void arch_safe_halt(void)
+{
+	native_safe_halt();
+}
+
+/*
+ * Used when interrupts are already enabled or to
+ * shutdown the processor:
+ */
+static inline __cpuidle void halt(void)
+{
+	native_halt();
+}
+#endif /* !__ASSEMBLY__ */
+#endif /* CONFIG_PARAVIRT_XL */
+
+#ifndef CONFIG_PARAVIRT_XXL
+#ifndef __ASSEMBLY__
 
 static __always_inline unsigned long arch_local_save_flags(void)
 {
@@ -94,24 +116,6 @@ static __always_inline void arch_local_irq_enable(void)
 	native_irq_enable();
 }
 
-/*
- * Used in the idle loop; sti takes one instruction cycle
- * to complete:
- */
-static inline __cpuidle void arch_safe_halt(void)
-{
-	native_safe_halt();
-}
-
-/*
- * Used when interrupts are already enabled or to
- * shutdown the processor:
- */
-static inline __cpuidle void halt(void)
-{
-	native_halt();
-}
-
 /*
  * For spinlocks, etc:
  */
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index f8dce11d2bc1..700b94abfd1b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,6 +84,18 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 	PVOP_VCALL1(mmu.exit_mmap, mm);
 }
 
+#ifdef CONFIG_PARAVIRT_XL
+static inline void arch_safe_halt(void)
+{
+	PVOP_VCALL0(irq.safe_halt);
+}
+
+static inline void halt(void)
+{
+	PVOP_VCALL0(irq.halt);
+}
+#endif
+
 #ifdef CONFIG_PARAVIRT_XXL
 static inline void load_sp0(unsigned long sp0)
 {
@@ -145,16 +157,6 @@ static inline void __write_cr4(unsigned long x)
 	PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-static inline void arch_safe_halt(void)
-{
-	PVOP_VCALL0(irq.safe_halt);
-}
-
-static inline void halt(void)
-{
-	PVOP_VCALL0(irq.halt);
-}
-
 static inline void wbinvd(void)
 {
 	PVOP_VCALL0(cpu.wbinvd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index b6b02b7c19cc..634482a0a60d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -190,7 +190,8 @@ struct pv_irq_ops {
 	struct paravirt_callee_save restore_fl;
 	struct paravirt_callee_save irq_disable;
 	struct paravirt_callee_save irq_enable;
-
+#endif
+#ifdef CONFIG_PARAVIRT_XL
 	void (*safe_halt)(void);
 	void (*halt)(void);
 #endif
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 6c3407ba6ee9..85714a6389d6 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -327,9 +327,11 @@ struct paravirt_patch_template pv_ops = {
 	.irq.restore_fl		= __PV_IS_CALLEE_SAVE(native_restore_fl),
 	.irq.irq_disable	= __PV_IS_CALLEE_SAVE(native_irq_disable),
 	.irq.irq_enable		= __PV_IS_CALLEE_SAVE(native_irq_enable),
+#endif /* CONFIG_PARAVIRT_XXL */
+#ifdef CONFIG_PARAVIRT_XL
 	.irq.safe_halt		= native_safe_halt,
 	.irq.halt		= native_halt,
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_XL */
 
 	/* Mmu ops. */
 	.mmu.flush_tlb_user	= native_flush_tlb_local,
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 6c5eb6f3f14f..20d0cb116557 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -24,6 +24,7 @@
  * be extended when new paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
+#undef CONFIG_PARAVIRT_XL
 #undef CONFIG_PARAVIRT_XXL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 02/26] x86/cpufeatures: Add TDX Guest CPU feature
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
  2021-02-05 23:38 ` [RFC v1 00/26] Add TDX Guest Support Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 01/26] x86/paravirt: Introduce CONFIG_PARAVIRT_XL Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface Kuppuswamy Sathyanarayanan
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Add CPU feature detection for Trusted Domain Extensions support.
TDX feature adds capabilities to keep guest register state and
memory isolated from hypervisor.

For TDX guest platforms, executing CPUID(0x21, 0) will return
following values in EAX, EBX, ECX and EDX.

EAX:  Maximum sub-leaf number:  0
EBX/EDX/ECX:  Vendor string:

EBX =  “Inte”
EDX =  ”lTDX”
ECX =  “    “

So when above condition is true, set X86_FEATURE_TDX_GUEST
feature cap bit

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/tdx.h         | 18 +++++++++++++++++
 arch/x86/kernel/Makefile           |  1 +
 arch/x86/kernel/head64.c           |  3 +++
 arch/x86/kernel/tdx.c              | 31 ++++++++++++++++++++++++++++++
 5 files changed, 54 insertions(+)
 create mode 100644 arch/x86/include/asm/tdx.h
 create mode 100644 arch/x86/kernel/tdx.c

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 84b887825f12..989e2b302880 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -238,6 +238,7 @@
 #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
 #define X86_FEATURE_SEV_ES		( 8*32+20) /* AMD Secure Encrypted Virtualization - Encrypted State */
 #define X86_FEATURE_VM_PAGE_FLUSH	( 8*32+21) /* "" VM Page Flush MSR is supported */
+#define X86_FEATURE_TDX_GUEST		( 8*32+22) /* Trusted Domain Extensions Guest */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE		( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
new file mode 100644
index 000000000000..2cc246c0cecf
--- /dev/null
+++ b/arch/x86/include/asm/tdx.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_TDX_H
+#define _ASM_X86_TDX_H
+
+#define TDX_CPUID_LEAF_ID	0x21
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+
+void __init tdx_early_init(void);
+
+#else // !CONFIG_INTEL_TDX_GUEST
+
+static inline void tdx_early_init(void) { };
+
+#endif /* CONFIG_INTEL_TDX_GUEST */
+
+#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 5eeb808eb024..ba8ee9300f23 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -128,6 +128,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 5e9beb77cafd..75f2401cb5db 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -40,6 +40,7 @@
 #include <asm/extable.h>
 #include <asm/trapnr.h>
 #include <asm/sev-es.h>
+#include <asm/tdx.h>
 
 /*
  * Manage page tables very early on.
@@ -491,6 +492,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 
 	kasan_early_init();
 
+	tdx_early_init();
+
 	idt_setup_early_handler();
 
 	copy_bootdata(__va(real_mode_data));
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
new file mode 100644
index 000000000000..473b4c1c0920
--- /dev/null
+++ b/arch/x86/kernel/tdx.c
@@ -0,0 +1,31 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Copyright (C) 2020 Intel Corporation */
+
+#include <asm/tdx.h>
+#include <asm/cpufeature.h>
+
+static inline bool cpuid_has_tdx_guest(void)
+{
+	u32 eax, signature[3];
+
+	if (cpuid_eax(0) < TDX_CPUID_LEAF_ID)
+		return false;
+
+	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &signature[0],
+			&signature[1], &signature[2]);
+
+	if (memcmp("IntelTDX    ", signature, 12))
+		return false;
+
+	return true;
+}
+
+void __init tdx_early_init(void)
+{
+	if (!cpuid_has_tdx_guest())
+		return;
+
+	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+	pr_info("TDX guest is initialized\n");
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (2 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 02/26] x86/cpufeatures: Add TDX Guest CPU feature Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-04-01 21:08   ` Dave Hansen
  2021-02-05 23:38 ` [RFC v1 04/26] x86/tdx: Get TD execution environment information via TDINFO Kuppuswamy Sathyanarayanan
                   ` (27 subsequent siblings)
  31 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan, Sean Christopherson

Add helper function to detect TDX feature support. It will be used
to protect TDX specific code.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile |  1 +
 arch/x86/boot/compressed/tdx.c    | 32 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/tdx.h        |  8 ++++++++
 arch/x86/kernel/tdx.c             |  6 ++++++
 4 files changed, 47 insertions(+)
 create mode 100644 arch/x86/boot/compressed/tdx.c

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index e0bc3988c3fa..a2554621cefe 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -96,6 +96,7 @@ ifdef CONFIG_X86_64
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
new file mode 100644
index 000000000000..0a87c1775b67
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * tdx.c - Early boot code for TDX
+ */
+
+#include <asm/tdx.h>
+
+static int __ro_after_init tdx_guest = -1;
+
+static inline bool native_cpuid_has_tdx_guest(void)
+{
+	u32 eax = TDX_CPUID_LEAF_ID, signature[3] = {0};
+
+	if (native_cpuid_eax(0) < TDX_CPUID_LEAF_ID)
+		return false;
+
+	native_cpuid(&eax, &signature[0], &signature[1], &signature[2]);
+
+	if (memcmp("IntelTDX    ", signature, 12))
+		return false;
+
+	return true;
+}
+
+bool is_tdx_guest(void)
+{
+	if (tdx_guest < 0)
+		tdx_guest = native_cpuid_has_tdx_guest();
+
+	return !!tdx_guest;
+}
+
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 2cc246c0cecf..0b9d571b1f95 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -7,10 +7,18 @@
 
 #ifdef CONFIG_INTEL_TDX_GUEST
 
+/* Common API to check TDX support in decompression and common kernel code. */
+bool is_tdx_guest(void);
+
 void __init tdx_early_init(void);
 
 #else // !CONFIG_INTEL_TDX_GUEST
 
+static inline bool is_tdx_guest(void)
+{
+	return false;
+}
+
 static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 473b4c1c0920..e44e55d1e519 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -20,6 +20,12 @@ static inline bool cpuid_has_tdx_guest(void)
 	return true;
 }
 
+bool is_tdx_guest(void)
+{
+	return static_cpu_has(X86_FEATURE_TDX_GUEST);
+}
+EXPORT_SYMBOL_GPL(is_tdx_guest);
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 04/26] x86/tdx: Get TD execution environment information via TDINFO
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (3 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-08 10:00   ` Peter Zijlstra
  2021-02-05 23:38 ` [RFC v1 05/26] x86/traps: Add #VE support for TDX guest Kuppuswamy Sathyanarayanan
                   ` (26 subsequent siblings)
  31 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Per Guest-Host-Communication Interface (GHCI) for Intel Trust
Domain Extensions (Intel TDX) specification, sec 2.4.2,
TDCALL[TDINFO] provides basic TD execution environment information, not
provided by CPUID.

Call TDINFO during early boot to be used for following system
initialization.

The call provides info on which bit in pfn is used to indicate that the
page is shared with the host and attributes of the TD, such as debug.

We don't save information about the number of cpus as there's no users
so far.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/tdx.h |  9 +++++++++
 arch/x86/kernel/tdx.c      | 27 +++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 0b9d571b1f95..f8cdc8eb1046 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -7,6 +7,15 @@
 
 #ifdef CONFIG_INTEL_TDX_GUEST
 
+/*
+ * TDCALL instruction is newly added in TDX architecture,
+ * used by TD for requesting the host VMM to provide
+ * (untrusted) services.
+ */
+#define TDCALL	".byte 0x66,0x0f,0x01,0xcc"
+
+#define TDINFO		1
+
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e44e55d1e519..13303bfdfdd1 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -3,6 +3,14 @@
 
 #include <asm/tdx.h>
 #include <asm/cpufeature.h>
+#include <linux/cpu.h>
+#include <asm/tdx.h>
+#include <asm/vmx.h>
+
+static struct {
+	unsigned int gpa_width;
+	unsigned long attributes;
+} td_info __ro_after_init;
 
 static inline bool cpuid_has_tdx_guest(void)
 {
@@ -26,6 +34,23 @@ bool is_tdx_guest(void)
 }
 EXPORT_SYMBOL_GPL(is_tdx_guest);
 
+static void tdx_get_info(void)
+{
+	register long rcx asm("rcx");
+	register long rdx asm("rdx");
+	register long r8 asm("r8");
+	long ret;
+
+	asm volatile(TDCALL
+		     : "=a"(ret), "=c"(rcx), "=r"(rdx), "=r"(r8)
+		     : "a"(TDINFO)
+		     : "r9", "r10", "r11", "memory");
+	BUG_ON(ret);
+
+	td_info.gpa_width = rcx & GENMASK(5, 0);
+	td_info.attributes = rdx;
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
@@ -33,5 +58,7 @@ void __init tdx_early_init(void)
 
 	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
+	tdx_get_info();
+
 	pr_info("TDX guest is initialized\n");
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (4 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 04/26] x86/tdx: Get TD execution environment information via TDINFO Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-08 10:20   ` Peter Zijlstra
                     ` (2 more replies)
  2021-02-05 23:38 ` [RFC v1 06/26] x86/tdx: Add HLT " Kuppuswamy Sathyanarayanan
                   ` (25 subsequent siblings)
  31 siblings, 3 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The TDX module injects #VE exception to the guest TD in cases of
disallowed instructions, disallowed MSR accesses and subset of CPUID
leaves. Also, it's theoretically possible for CPU to inject #VE
exception on EPT violation, but the TDX module makes sure this does
not happen, as long as all memory used is properly accepted using
TDCALLs. You can find more details about it in, Guest-Host-Communication
Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX)
specification, sec 2.3.

Add basic infrastructure to handle #VE. If there is no handler for a
given #VE, since its a unexpected event (fault case), treat it as a
general protection fault and handle it using do_general_protection()
call.

TDCALL[TDGETVEINFO] provides information about #VE such as exit reason.

More details on cases where #VE exceptions are allowed/not-allowed:

The #VE exception do not occur in the paranoid entry paths, like NMIs.
While other operations during an NMI might cause #VE, these are in the
NMI code that can handle nesting, so there is no concern about
reentrancy. This is similar to how #PF is handled in NMIs.

The #VE exception also cannot happen in entry/exit code with the
wrong gs, such as the SWAPGS code, so it's entry point does not
need "paranoid" handling.

Any memory accesses can cause #VE if it causes an EPT
violation.  However, the VMM is only in direct control of some of the
EPT tables.  The Secure EPT tables are controlled by the TDX module
which guarantees no EPT violations will result in #VE for the guest,
once the memory has been accepted.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/idtentry.h |  4 ++
 arch/x86/include/asm/tdx.h      | 14 +++++++
 arch/x86/kernel/idt.c           |  6 +++
 arch/x86/kernel/tdx.c           | 31 ++++++++++++++
 arch/x86/kernel/traps.c         | 73 ++++++++++++++++++++++-----------
 5 files changed, 105 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 247a60a47331..a2cbb68f9ae8 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -615,6 +615,10 @@ DECLARE_IDTENTRY_VC(X86_TRAP_VC,	exc_vmm_communication);
 DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER,	exc_xen_hypervisor_callback);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE,		exc_virtualization_exception);
+#endif
+
 /* Device interrupts common/spurious */
 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	common_interrupt);
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f8cdc8eb1046..90eb61b07d1f 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -15,6 +15,7 @@
 #define TDCALL	".byte 0x66,0x0f,0x01,0xcc"
 
 #define TDINFO		1
+#define TDGETVEINFO	3
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
@@ -32,4 +33,17 @@ static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
+struct ve_info {
+	unsigned int exit_reason;
+	unsigned long exit_qual;
+	unsigned long gla;
+	unsigned long gpa;
+	unsigned int instr_len;
+	unsigned int instr_info;
+};
+
+unsigned long tdx_get_ve_info(struct ve_info *ve);
+int tdx_handle_virtualization_exception(struct pt_regs *regs,
+		struct ve_info *ve);
+
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..546b6b636c7d 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = {
 	 */
 	INTG(X86_TRAP_PF,		asm_exc_page_fault),
 #endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
+#endif
 };
 
 /*
@@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = {
 	INTG(X86_TRAP_MF,		asm_exc_coprocessor_error),
 	INTG(X86_TRAP_AC,		asm_exc_alignment_check),
 	INTG(X86_TRAP_XF,		asm_exc_simd_coprocessor_error),
+#ifdef CONFIG_INTEL_TDX_GUEST
+	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
+#endif
 
 #ifdef CONFIG_X86_32
 	TSKG(X86_TRAP_DF,		GDT_ENTRY_DOUBLEFAULT_TSS),
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 13303bfdfdd1..ae2d5c847700 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -62,3 +62,34 @@ void __init tdx_early_init(void)
 
 	pr_info("TDX guest is initialized\n");
 }
+
+unsigned long tdx_get_ve_info(struct ve_info *ve)
+{
+	register long r8 asm("r8");
+	register long r9 asm("r9");
+	register long r10 asm("r10");
+	unsigned long ret;
+
+	asm volatile(TDCALL
+		     : "=a"(ret), "=c"(ve->exit_reason), "=d"(ve->exit_qual),
+		     "=r"(r8), "=r"(r9), "=r"(r10)
+		     : "a"(TDGETVEINFO)
+		     :);
+
+	ve->gla = r8;
+	ve->gpa = r9;
+	ve->instr_len = r10 & UINT_MAX;
+	ve->instr_info = r10 >> 32;
+	return ret;
+}
+
+int tdx_handle_virtualization_exception(struct pt_regs *regs,
+		struct ve_info *ve)
+{
+	/*
+	 * TODO: Add handler support for various #VE exit
+	 * reasons
+	 */
+	pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
+	return -EFAULT;
+}
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 7f5aec758f0e..ba98253b47cd 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/vdso.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_X86_64
 #include <asm/x86_init.h>
@@ -527,30 +528,14 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
 
 #define GPFSTR "general protection fault"
 
-DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
+static void do_general_protection(struct pt_regs *regs, long error_code)
 {
 	char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
 	enum kernel_gp_hint hint = GP_NO_HINT;
-	struct task_struct *tsk;
+	struct task_struct *tsk = current;
 	unsigned long gp_addr;
 	int ret;
 
-	cond_local_irq_enable(regs);
-
-	if (static_cpu_has(X86_FEATURE_UMIP)) {
-		if (user_mode(regs) && fixup_umip_exception(regs))
-			goto exit;
-	}
-
-	if (v8086_mode(regs)) {
-		local_irq_enable();
-		handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
-		local_irq_disable();
-		return;
-	}
-
-	tsk = current;
-
 	if (user_mode(regs)) {
 		tsk->thread.error_code = error_code;
 		tsk->thread.trap_nr = X86_TRAP_GP;
@@ -560,11 +545,11 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 
 		show_signal(tsk, SIGSEGV, "", desc, regs, error_code);
 		force_sig(SIGSEGV);
-		goto exit;
+		return;
 	}
 
 	if (fixup_exception(regs, X86_TRAP_GP, error_code, 0))
-		goto exit;
+		return;
 
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_nr = X86_TRAP_GP;
@@ -576,11 +561,11 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	if (!preemptible() &&
 	    kprobe_running() &&
 	    kprobe_fault_handler(regs, X86_TRAP_GP))
-		goto exit;
+		return;
 
 	ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV);
 	if (ret == NOTIFY_STOP)
-		goto exit;
+		return;
 
 	if (error_code)
 		snprintf(desc, sizeof(desc), "segment-related " GPFSTR);
@@ -601,8 +586,27 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 		gp_addr = 0;
 
 	die_addr(desc, regs, error_code, gp_addr);
+}
 
-exit:
+DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
+{
+	cond_local_irq_enable(regs);
+
+	if (static_cpu_has(X86_FEATURE_UMIP)) {
+		if (user_mode(regs) && fixup_umip_exception(regs)) {
+			cond_local_irq_disable(regs);
+			return;
+		}
+	}
+
+	if (v8086_mode(regs)) {
+		local_irq_enable();
+		handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
+		local_irq_disable();
+		return;
+	}
+
+	do_general_protection(regs, error_code);
 	cond_local_irq_disable(regs);
 }
 
@@ -1138,6 +1142,29 @@ DEFINE_IDTENTRY(exc_device_not_available)
 	}
 }
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+	struct ve_info ve;
+	int ret;
+
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+
+	/* Consume #VE info before re-enabling interrupts */
+	ret = tdx_get_ve_info(&ve);
+	cond_local_irq_enable(regs);
+	if (!ret)
+		ret = tdx_handle_virtualization_exception(regs, &ve);
+	/*
+	 * If #VE exception handler could not handle it successfully, treat
+	 * it as #GP(0) and handle it.
+	 */
+	if (ret)
+		do_general_protection(regs, 0);
+	cond_local_irq_disable(regs);
+}
+#endif
+
 #ifdef CONFIG_X86_32
 DEFINE_IDTENTRY_SW(iret_error)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 06/26] x86/tdx: Add HLT support for TDX guest
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (5 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 05/26] x86/traps: Add #VE support for TDX guest Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 07/26] x86/tdx: Wire up KVM hypercalls Kuppuswamy Sathyanarayanan
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Per Guest-Host-Communication Interface (GHCI) for Intel Trust
Domain Extensions (Intel TDX) specification, sec 3.8,
TDVMCALL[Instruction.HLT] provides HLT operation. Use it to implement
halt() and safe_halt() paravirtualization calls.

The same TDVMCALL is used to handle #VE exception due to
EXIT_REASON_HLT.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/tdx.h |  5 ++++
 arch/x86/kernel/tdx.c      | 61 ++++++++++++++++++++++++++++++++++----
 2 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 90eb61b07d1f..b98de067257b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -14,9 +14,14 @@
  */
 #define TDCALL	".byte 0x66,0x0f,0x01,0xcc"
 
+#define TDVMCALL	0
 #define TDINFO		1
 #define TDGETVEINFO	3
 
+/* TDVMCALL R10 Input */
+#define TDVMCALL_STANDARD	0
+#define TDVMCALL_VENDOR		1
+
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ae2d5c847700..25dd33bc2e49 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -51,6 +51,45 @@ static void tdx_get_info(void)
 	td_info.attributes = rdx;
 }
 
+static __cpuidle void tdx_halt(void)
+{
+	register long r10 asm("r10") = TDVMCALL_STANDARD;
+	register long r11 asm("r11") = EXIT_REASON_HLT;
+	register long rcx asm("rcx");
+	long ret;
+
+	/* Allow to pass R10 and R11 down to the VMM */
+	rcx = BIT(10) | BIT(11);
+
+	asm volatile(TDCALL
+			: "=a"(ret), "=r"(r10), "=r"(r11)
+			: "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11)
+			: );
+
+	/* It should never fail */
+	BUG_ON(ret || r10);
+}
+
+static __cpuidle void tdx_safe_halt(void)
+{
+	register long r10 asm("r10") = TDVMCALL_STANDARD;
+	register long r11 asm("r11") = EXIT_REASON_HLT;
+	register long rcx asm("rcx");
+	long ret;
+
+	/* Allow to pass R10 and R11 down to the VMM */
+	rcx = BIT(10) | BIT(11);
+
+	/* Enable interrupts next to the TDVMCALL to avoid performance degradation */
+	asm volatile("sti\n\t" TDCALL
+			: "=a"(ret), "=r"(r10), "=r"(r11)
+			: "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11)
+			: );
+
+	/* It should never fail */
+	BUG_ON(ret || r10);
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
@@ -60,6 +99,9 @@ void __init tdx_early_init(void)
 
 	tdx_get_info();
 
+	pv_ops.irq.safe_halt = tdx_safe_halt;
+	pv_ops.irq.halt = tdx_halt;
+
 	pr_info("TDX guest is initialized\n");
 }
 
@@ -86,10 +128,17 @@ unsigned long tdx_get_ve_info(struct ve_info *ve)
 int tdx_handle_virtualization_exception(struct pt_regs *regs,
 		struct ve_info *ve)
 {
-	/*
-	 * TODO: Add handler support for various #VE exit
-	 * reasons
-	 */
-	pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
-	return -EFAULT;
+	switch (ve->exit_reason) {
+	case EXIT_REASON_HLT:
+		tdx_halt();
+		break;
+	default:
+		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
+		return -EFAULT;
+	}
+
+	/* After successful #VE handling, move the IP */
+	regs->ip += ve->instr_len;
+
+	return ret;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 07/26] x86/tdx: Wire up KVM hypercalls
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (6 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 06/26] x86/tdx: Add HLT " Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 08/26] x86/tdx: Add MSR support for TDX guest Kuppuswamy Sathyanarayanan
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

KVM hypercalls have to be wrapped into vendor-specific TDVMCALLs.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/kvm_para.h |  21 ++++++
 arch/x86/include/asm/tdx.h      |   8 +++
 arch/x86/kernel/tdx-kvm.c       | 116 ++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c           |   4 ++
 4 files changed, 149 insertions(+)
 create mode 100644 arch/x86/kernel/tdx-kvm.c

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..2fa85481520b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,6 +6,7 @@
 #include <asm/alternative.h>
 #include <linux/interrupt.h>
 #include <uapi/asm/kvm_para.h>
+#include <asm/tdx.h>
 
 extern void kvmclock_init(void);
 
@@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 static inline long kvm_hypercall0(unsigned int nr)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall0(nr);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr)
@@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
 static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall1(nr, p1);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1)
@@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
 				  unsigned long p2)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall2(nr, p1, p2);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2)
@@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
 				  unsigned long p2, unsigned long p3)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall3(nr, p1, p2, p3);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 				  unsigned long p4)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b98de067257b..8c3e5af88643 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -51,4 +51,12 @@ unsigned long tdx_get_ve_info(struct ve_info *ve);
 int tdx_handle_virtualization_exception(struct pt_regs *regs,
 		struct ve_info *ve);
 
+long tdx_kvm_hypercall0(unsigned int nr);
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1);
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2);
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3);
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3, unsigned long p4);
+
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx-kvm.c b/arch/x86/kernel/tdx-kvm.c
new file mode 100644
index 000000000000..323d43fcb338
--- /dev/null
+++ b/arch/x86/kernel/tdx-kvm.c
@@ -0,0 +1,116 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+long tdx_kvm_hypercall0(unsigned int nr)
+{
+	register long r10 asm("r10") = TDVMCALL_VENDOR;
+	register long r11 asm("r11") = nr;
+	register long rcx asm("rcx");
+	long ret;
+
+	/* Allow to pass R10 and R11 down to the VMM */
+	rcx = BIT(10) | BIT(11);
+
+	asm volatile(TDCALL
+			: "=a"(ret), "=r"(r10)
+			: "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11)
+			: "memory");
+
+	BUG_ON(ret);
+	return r10;
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall0);
+
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	register long r10 asm("r10") = TDVMCALL_VENDOR;
+	register long r11 asm("r11") = nr;
+	register long r12 asm("r12") = p1;
+	register long rcx asm("rcx");
+	long ret;
+
+	/* Allow to pass R10, R11 and R12 down to the VMM */
+	rcx = BIT(10) | BIT(11) | BIT(12);
+
+	asm volatile(TDCALL
+			: "=a"(ret), "=r"(r10)
+			: "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12)
+			: "memory");
+
+	BUG_ON(ret);
+	return r10;
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall1);
+
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2)
+{
+	register long r10 asm("r10") = TDVMCALL_VENDOR;
+	register long r11 asm("r11") = nr;
+	register long r12 asm("r12") = p1;
+	register long r13 asm("r13") = p2;
+	register long rcx asm("rcx");
+	long ret;
+
+	/* Allow to pass R10, R11, R12 and R13 down to the VMM */
+	rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13);
+
+	asm volatile(TDCALL
+			: "=a"(ret), "=r"(r10)
+			: "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
+			  "r"(r13)
+			: "memory");
+
+	BUG_ON(ret);
+	return r10;
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall2);
+
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3)
+{
+	register long r10 asm("r10") = TDVMCALL_VENDOR;
+	register long r11 asm("r11") = nr;
+	register long r12 asm("r12") = p1;
+	register long r13 asm("r13") = p2;
+	register long r14 asm("r14") = p3;
+	register long rcx asm("rcx");
+	long ret;
+
+	/* Allow to pass R10, R11, R12, R13 and R14 down to the VMM */
+	rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14);
+
+	asm volatile(TDCALL
+			: "=a"(ret), "=r"(r10)
+			: "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
+			  "r"(r13), "r"(r14)
+			: "memory");
+
+	BUG_ON(ret);
+	return r10;
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall3);
+
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3, unsigned long p4)
+{
+	register long r10 asm("r10") = TDVMCALL_VENDOR;
+	register long r11 asm("r11") = nr;
+	register long r12 asm("r12") = p1;
+	register long r13 asm("r13") = p2;
+	register long r14 asm("r14") = p3;
+	register long r15 asm("r15") = p4;
+	register long rcx asm("rcx");
+	long ret;
+
+	/* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */
+	rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15);
+
+	asm volatile(TDCALL
+			: "=a"(ret), "=r"(r10)
+			: "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
+			  "r"(r13), "r"(r14), "r"(r15)
+			: "memory");
+
+	BUG_ON(ret);
+	return r10;
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 25dd33bc2e49..bbefe639a2ed 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,10 @@
 #include <asm/tdx.h>
 #include <asm/vmx.h>
 
+#ifdef CONFIG_KVM_GUEST
+#include "tdx-kvm.c"
+#endif
+
 static struct {
 	unsigned int gpa_width;
 	unsigned long attributes;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 08/26] x86/tdx: Add MSR support for TDX guest
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (7 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 07/26] x86/tdx: Wire up KVM hypercalls Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 09/26] x86/tdx: Handle CPUID via #VE Kuppuswamy Sathyanarayanan
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Operations on context-switched MSRs can be run natively. The rest of
MSRs should be handled through TDVMCALLs.

TDVMCALL[Instruction.RDMSR] and TDVMCALL[Instruction.WRMSR] provide
MSR oprations.

You can find RDMSR and WRMSR details in Guest-Host-Communication
Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX)
specification, sec 3.10, 3.11.

Also, since CSTAR MSR is not used on Intel CPUs as SYSCALL
instruction, ignore accesses to CSTAR MSR. Ignore accesses to
the MSR for compatibility: no need in wrap callers in
!is_tdx_guest().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 94 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 93 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index bbefe639a2ed..5d961263601e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -94,6 +94,84 @@ static __cpuidle void tdx_safe_halt(void)
 	BUG_ON(ret || r10);
 }
 
+static bool tdx_is_context_switched_msr(unsigned int msr)
+{
+	/*  XXX: Update the list of context-switched MSRs */
+
+	switch (msr) {
+	case MSR_EFER:
+	case MSR_IA32_CR_PAT:
+	case MSR_FS_BASE:
+	case MSR_GS_BASE:
+	case MSR_KERNEL_GS_BASE:
+	case MSR_IA32_SYSENTER_CS:
+	case MSR_IA32_SYSENTER_EIP:
+	case MSR_IA32_SYSENTER_ESP:
+	case MSR_STAR:
+	case MSR_LSTAR:
+	case MSR_SYSCALL_MASK:
+	case MSR_IA32_XSS:
+	case MSR_TSC_AUX:
+	case MSR_IA32_BNDCFGS:
+		return true;
+	}
+	return false;
+}
+
+static u64 tdx_read_msr_safe(unsigned int msr, int *err)
+{
+	register long r10 asm("r10") = TDVMCALL_STANDARD;
+	register long r11 asm("r11") = EXIT_REASON_MSR_READ;
+	register long r12 asm("r12") = msr;
+	register long rcx asm("rcx");
+	long ret;
+
+	WARN_ON_ONCE(tdx_is_context_switched_msr(msr));
+
+	if (msr == MSR_CSTAR)
+		return 0;
+
+	/* Allow to pass R10, R11 and R12 down to the VMM */
+	rcx = BIT(10) | BIT(11) | BIT(12);
+
+	asm volatile(TDCALL
+			: "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12)
+			: "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12)
+			: );
+
+	/* XXX: Better error handling needed? */
+	*err = (ret || r10) ? -EIO : 0;
+
+	return r11;
+}
+
+static int tdx_write_msr_safe(unsigned int msr, unsigned int low,
+			      unsigned int high)
+{
+	register long r10 asm("r10") = TDVMCALL_STANDARD;
+	register long r11 asm("r11") = EXIT_REASON_MSR_WRITE;
+	register long r12 asm("r12") = msr;
+	register long r13 asm("r13") = (u64)high << 32 | low;
+	register long rcx asm("rcx");
+	long ret;
+
+	WARN_ON_ONCE(tdx_is_context_switched_msr(msr));
+
+	if (msr == MSR_CSTAR)
+		return 0;
+
+	/* Allow to pass R10, R11, R12 and R13 down to the VMM */
+	rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13);
+
+	asm volatile(TDCALL
+			: "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13)
+			: "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
+			  "r"(r13)
+			: );
+
+	return ret || r10 ? -EIO : 0;
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
@@ -132,17 +210,31 @@ unsigned long tdx_get_ve_info(struct ve_info *ve)
 int tdx_handle_virtualization_exception(struct pt_regs *regs,
 		struct ve_info *ve)
 {
+	unsigned long val;
+	int ret = 0;
+
 	switch (ve->exit_reason) {
 	case EXIT_REASON_HLT:
 		tdx_halt();
 		break;
+	case EXIT_REASON_MSR_READ:
+		val = tdx_read_msr_safe(regs->cx, (unsigned int *)&ret);
+		if (!ret) {
+			regs->ax = val & UINT_MAX;
+			regs->dx = val >> 32;
+		}
+		break;
+	case EXIT_REASON_MSR_WRITE:
+		ret = tdx_write_msr_safe(regs->cx, regs->ax, regs->dx);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
 		return -EFAULT;
 	}
 
 	/* After successful #VE handling, move the IP */
-	regs->ip += ve->instr_len;
+	if (!ret)
+		regs->ip += ve->instr_len;
 
 	return ret;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 09/26] x86/tdx: Handle CPUID via #VE
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (8 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 08/26] x86/tdx: Add MSR support for TDX guest Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:42   ` Andy Lutomirski
  2021-02-05 23:38 ` [RFC v1 10/26] x86/io: Allow to override inX() and outX() implementation Kuppuswamy Sathyanarayanan
                   ` (21 subsequent siblings)
  31 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

TDX has three classes of CPUID leaves: some CPUID leaves
are always handled by the CPU, others are handled by the TDX module,
and some others are handled by the VMM. Since the VMM cannot directly
intercept the instruction these are reflected with a #VE exception
to the guest, which then converts it into a TDCALL to the VMM,
or handled directly.

The TDX module EAS has a full list of CPUID leaves which are handled
natively or by the TDX module in 16.2. Only unknown CPUIDs are handled by
the #VE method. In practice this typically only applies to the
hypervisor specific CPUIDs unknown to the native CPU.

Therefore there is no risk of causing this in early CPUID code which
runs before the #VE handler is set up because it will never access
those exotic CPUID leaves.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 5d961263601e..e98058c048b5 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -172,6 +172,35 @@ static int tdx_write_msr_safe(unsigned int msr, unsigned int low,
 	return ret || r10 ? -EIO : 0;
 }
 
+static void tdx_handle_cpuid(struct pt_regs *regs)
+{
+	register long r10 asm("r10") = TDVMCALL_STANDARD;
+	register long r11 asm("r11") = EXIT_REASON_CPUID;
+	register long r12 asm("r12") = regs->ax;
+	register long r13 asm("r13") = regs->cx;
+	register long r14 asm("r14");
+	register long r15 asm("r15");
+	register long rcx asm("rcx");
+	long ret;
+
+	/* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */
+	rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15);
+
+	asm volatile(TDCALL
+			: "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13),
+			  "=r"(r14), "=r"(r15)
+			: "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
+			  "r"(r13)
+			: );
+
+	regs->ax = r12;
+	regs->bx = r13;
+	regs->cx = r14;
+	regs->dx = r15;
+
+	WARN_ON(ret || r10);
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
@@ -227,6 +256,9 @@ int tdx_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_MSR_WRITE:
 		ret = tdx_write_msr_safe(regs->cx, regs->ax, regs->dx);
 		break;
+	case EXIT_REASON_CPUID:
+		tdx_handle_cpuid(regs);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 10/26] x86/io: Allow to override inX() and outX() implementation
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (9 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 09/26] x86/tdx: Handle CPUID via #VE Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 11/26] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The patch allows to override the implementation of the port IO
helpers. TDX code will provide an implementation that redirect the
helpers to paravirt calls.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/io.h | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d726459d08e5..ef7a686a55a9 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -271,18 +271,26 @@ static inline bool sev_key_active(void) { return false; }
 
 #endif /* CONFIG_AMD_MEM_ENCRYPT */
 
+#ifndef __out
+#define __out(bwl, bw)							\
+	asm volatile("out" #bwl " %" #bw "0, %w1" : : "a"(value), "Nd"(port))
+#endif
+
+#ifndef __in
+#define __in(bwl, bw)							\
+	asm volatile("in" #bwl " %w1, %" #bw "0" : "=a"(value) : "Nd"(port))
+#endif
+
 #define BUILDIO(bwl, bw, type)						\
 static inline void out##bwl(unsigned type value, int port)		\
 {									\
-	asm volatile("out" #bwl " %" #bw "0, %w1"			\
-		     : : "a"(value), "Nd"(port));			\
+	__out(bwl, bw);							\
 }									\
 									\
 static inline unsigned type in##bwl(int port)				\
 {									\
 	unsigned type value;						\
-	asm volatile("in" #bwl " %w1, %" #bw "0"			\
-		     : "=a"(value) : "Nd"(port));			\
+	__in(bwl, bw);							\
 	return value;							\
 }									\
 									\
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 11/26] x86/tdx: Handle port I/O
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (10 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 10/26] x86/io: Allow to override inX() and outX() implementation Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 12/26] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Unroll string operations and handle port I/O through TDVMCALLs.
Also handle #VE due to I/O operations with the same TDVMCALLs.

Decompression code uses port IO for earlyprintk. We must use
paravirt calls there too if we want to allow earlyprintk.

Decompresion code cannot deal with alternatives: use branches
instead to implement inX() and outX() helpers.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile     |   1 +
 arch/x86/boot/compressed/tdx_io.S     |   9 ++
 arch/x86/include/asm/asm-prototypes.h |   1 +
 arch/x86/include/asm/io.h             |   5 +-
 arch/x86/include/asm/tdx.h            |  62 +++++++++--
 arch/x86/kernel/Makefile              |   2 +-
 arch/x86/kernel/tdx.c                 |  72 +++++++++++++
 arch/x86/kernel/tdx_io.S              | 143 ++++++++++++++++++++++++++
 8 files changed, 284 insertions(+), 11 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdx_io.S
 create mode 100644 arch/x86/kernel/tdx_io.S

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index a2554621cefe..54da333adc4e 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -97,6 +97,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx_io.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdx_io.S b/arch/x86/boot/compressed/tdx_io.S
new file mode 100644
index 000000000000..67498f67cb18
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx_io.S
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include <asm/export.h>
+
+/* Do not export symbols in decompression code */
+#undef EXPORT_SYMBOL
+#define EXPORT_SYMBOL(sym)
+
+#include "../../kernel/tdx_io.S"
diff --git a/arch/x86/include/asm/asm-prototypes.h b/arch/x86/include/asm/asm-prototypes.h
index 51e2bf27cc9b..6bc97aa39a21 100644
--- a/arch/x86/include/asm/asm-prototypes.h
+++ b/arch/x86/include/asm/asm-prototypes.h
@@ -6,6 +6,7 @@
 #include <asm/page.h>
 #include <asm/checksum.h>
 #include <asm/mce.h>
+#include <asm/tdx.h>
 
 #include <asm-generic/asm-prototypes.h>
 
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index ef7a686a55a9..30a3b30395ad 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -43,6 +43,7 @@
 #include <asm/page.h>
 #include <asm/early_ioremap.h>
 #include <asm/pgtable_types.h>
+#include <asm/tdx.h>
 
 #define build_mmio_read(name, size, type, reg, barrier) \
 static inline type name(const volatile void __iomem *addr) \
@@ -309,7 +310,7 @@ static inline unsigned type in##bwl##_p(int port)			\
 									\
 static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 {									\
-	if (sev_key_active()) {						\
+	if (sev_key_active() || is_tdx_guest()) {			\
 		unsigned type *value = (unsigned type *)addr;		\
 		while (count) {						\
 			out##bwl(*value, port);				\
@@ -325,7 +326,7 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 									\
 static inline void ins##bwl(int port, void *addr, unsigned long count)	\
 {									\
-	if (sev_key_active()) {						\
+	if (sev_key_active() || is_tdx_guest()) {			\
 		unsigned type *value = (unsigned type *)addr;		\
 		while (count) {						\
 			*value = in##bwl(port);				\
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8c3e5af88643..b46ae140e39b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -5,7 +5,16 @@
 
 #define TDX_CPUID_LEAF_ID	0x21
 
-#ifdef CONFIG_INTEL_TDX_GUEST
+#define TDVMCALL	0
+#define TDINFO		1
+#define TDGETVEINFO	3
+
+/* TDVMCALL R10 Input */
+#define TDVMCALL_STANDARD	0
+#define TDVMCALL_VENDOR		1
+
+#ifndef __ASSEMBLY__
+#include <asm/cpufeature.h>
 
 /*
  * TDCALL instruction is newly added in TDX architecture,
@@ -14,19 +23,55 @@
  */
 #define TDCALL	".byte 0x66,0x0f,0x01,0xcc"
 
-#define TDVMCALL	0
-#define TDINFO		1
-#define TDGETVEINFO	3
-
-/* TDVMCALL R10 Input */
-#define TDVMCALL_STANDARD	0
-#define TDVMCALL_VENDOR		1
+#ifdef CONFIG_INTEL_TDX_GUEST
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
 void __init tdx_early_init(void);
 
+/* Decompression code doesn't know how to handle alternatives */
+#ifdef BOOT_COMPRESSED_MISC_H
+#define __out(bwl, bw)							\
+do {									\
+	if (is_tdx_guest()) {						\
+		asm volatile("call tdx_out" #bwl : :			\
+				"a"(value), "d"(port));			\
+	} else {							\
+		asm volatile("out" #bwl " %" #bw "0, %w1" : :		\
+				"a"(value), "Nd"(port));		\
+	}								\
+} while (0)
+#define __in(bwl, bw)							\
+do {									\
+	if (is_tdx_guest()) {						\
+		asm volatile("call tdx_in" #bwl :			\
+				"=a"(value) : "d"(port));		\
+	} else {							\
+		asm volatile("in" #bwl " %w1, %" #bw "0" :		\
+				"=a"(value) : "Nd"(port));		\
+	}								\
+} while (0)
+#else
+#define __out(bwl, bw)							\
+	alternative_input("out" #bwl " %" #bw "1, %w2",			\
+			"call tdx_out" #bwl, X86_FEATURE_TDX_GUEST,	\
+			"a"(value), "d"(port))
+
+#define __in(bwl, bw)							\
+	alternative_io("in" #bwl " %w2, %" #bw "0",			\
+			"call tdx_in" #bwl, X86_FEATURE_TDX_GUEST,	\
+			"=a"(value), "d"(port))
+#endif
+
+void tdx_outb(unsigned char value, unsigned short port);
+void tdx_outw(unsigned short value, unsigned short port);
+void tdx_outl(unsigned int value, unsigned short port);
+
+unsigned char tdx_inb(unsigned short port);
+unsigned short tdx_inw(unsigned short port);
+unsigned int tdx_inl(unsigned short port);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
@@ -59,4 +104,5 @@ long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
 long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
 		unsigned long p3, unsigned long p4);
 
+#endif
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ba8ee9300f23..c1ec77df3213 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -128,7 +128,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o tdx_io.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e98058c048b5..3846d2807a7a 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -201,6 +201,75 @@ static void tdx_handle_cpuid(struct pt_regs *regs)
 	WARN_ON(ret || r10);
 }
 
+static void tdx_out(int size, unsigned int value, int port)
+{
+	register long r10 asm("r10") = TDVMCALL_STANDARD;
+	register long r11 asm("r11") = EXIT_REASON_IO_INSTRUCTION;
+	register long r12 asm("r12") = size;
+	register long r13 asm("r13") = 1;
+	register long r14 asm("r14") = port;
+	register long r15 asm("r15") = value;
+	register long rcx asm("rcx");
+	long ret;
+
+	/* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */
+	rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15);
+
+	asm volatile(TDCALL
+			: "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13),
+			  "=r"(r14), "=r"(r15)
+			: "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
+			  "r"(r13), "r"(r14), "r"(r15)
+			: );
+
+	WARN_ON(ret || r10);
+}
+
+static unsigned int tdx_in(int size, int port)
+{
+	register long r10 asm("r10") = TDVMCALL_STANDARD;
+	register long r11 asm("r11") = EXIT_REASON_IO_INSTRUCTION;
+	register long r12 asm("r12") = size;
+	register long r13 asm("r13") = 0;
+	register long r14 asm("r14") = port;
+	register long rcx asm("rcx");
+	long ret;
+
+	/* Allow to pass R10, R11, R12, R13 and R14 down to the VMM */
+	rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14);
+
+	asm volatile(TDCALL
+			: "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13),
+			  "=r"(r14)
+			: "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
+			  "r"(r13), "r"(r14)
+			: );
+
+	WARN_ON(ret || r10);
+
+	return r11;
+}
+
+static void tdx_handle_io(struct pt_regs *regs, u32 exit_qual)
+{
+	bool string = exit_qual & 16;
+	int out, size, port;
+
+	/* I/O strings ops are unrolled at build time. */
+	BUG_ON(string);
+
+	out = (exit_qual & 8) ? 0 : 1;
+	size = (exit_qual & 7) + 1;
+	port = exit_qual >> 16;
+
+	if (out) {
+		tdx_out(size, regs->ax, port);
+	} else {
+		regs->ax &= ~GENMASK(8 * size, 0);
+		regs->ax |= tdx_in(size, port) & GENMASK(8 * size, 0);
+	}
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
@@ -259,6 +328,9 @@ int tdx_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_CPUID:
 		tdx_handle_cpuid(regs);
 		break;
+	case EXIT_REASON_IO_INSTRUCTION:
+		tdx_handle_io(regs, ve->exit_qual);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
 		return -EFAULT;
diff --git a/arch/x86/kernel/tdx_io.S b/arch/x86/kernel/tdx_io.S
new file mode 100644
index 000000000000..00ccbc9711fe
--- /dev/null
+++ b/arch/x86/kernel/tdx_io.S
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include <linux/linkage.h>
+#include <asm/export.h>
+
+#include <asm/tdx.h>
+
+#define TDCALL .byte 0x66,0x0f,0x01,0xcc
+
+#define EXIT_REASON_IO_INSTRUCTION 30
+
+SYM_FUNC_START(tdx_outb)
+	push	%r15
+	push	%r12
+
+	xor	%r15, %r15
+	mov	%al, %r15b
+	mov	$1, %r12
+	jmp	1f
+
+SYM_FUNC_START(tdx_outw)
+	push	%r15
+	push	%r12
+
+	xor	%r15, %r15
+	mov	%ax, %r15w
+	mov	$2, %r12
+	jmp	1f
+
+SYM_FUNC_START(tdx_outl)
+	push	%r15
+	push	%r12
+
+	xor	%r15, %r15
+	mov	%eax, %r15d
+	mov	$4, %r12
+1:
+	push	%rax
+	push	%rcx
+	push	%r10
+	push	%r11
+	push	%r13
+	push	%r14
+
+	mov	$TDVMCALL, %rax
+	mov	$TDVMCALL_STANDARD, %r10
+	mov	$EXIT_REASON_IO_INSTRUCTION, %r11
+	mov	$1, %r13
+	xor	%r14, %r14
+	mov	%dx, %r14w
+	/* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */
+	mov	$0xfc00, %rcx
+
+	TDCALL
+
+	/* Panic if TDVMCALL reports failure */
+	test	%rax, %rax
+	jnz	1f
+
+	/* Panic if TDVMCALL reports failure */
+	test	%r10, %r10
+	jnz	1f
+
+	pop	%r14
+	pop	%r13
+	pop	%r11
+	pop	%r10
+	pop	%rcx
+	pop	%rax
+
+	pop	%r12
+	pop	%r15
+	ret
+1:
+	ud2
+SYM_FUNC_END(tdx_outb)
+SYM_FUNC_END(tdx_outw)
+SYM_FUNC_END(tdx_outl)
+EXPORT_SYMBOL(tdx_outb)
+EXPORT_SYMBOL(tdx_outw)
+EXPORT_SYMBOL(tdx_outl)
+
+SYM_FUNC_START(tdx_inb)
+	push	%r12
+	mov	$1, %r12
+	jmp	1f
+
+SYM_FUNC_START(tdx_inw)
+	push	%r12
+	mov	$2, %r12
+	jmp	1f
+
+SYM_FUNC_START(tdx_inl)
+	push	%r12
+
+	mov	$4, %r12
+1:
+	push	%r11
+	push	%rax
+	push	%rcx
+	push	%r10
+	push	%r13
+	push	%r14
+
+	mov	$TDVMCALL, %rax
+	mov	$TDVMCALL_STANDARD, %r10
+	mov	$EXIT_REASON_IO_INSTRUCTION, %r11
+	mov	$0, %r13
+	xor	%r14, %r14
+	mov	%dx, %r14w
+
+	/* Allow to pass R10, R11, R12, R13 and R14 down to the VMM */
+	mov	$0x7c00, %rcx
+
+	TDCALL
+
+	/* Panic if TDVMCALL reports failure */
+	test	%rax, %rax
+	jnz	1f
+
+	/* Panic if TDVMCALL reports failure */
+	test	%r10, %r10
+	jnz	1f
+
+	pop	%r14
+	pop	%r13
+	pop	%r10
+	pop	%rcx
+	pop	%rax
+
+	mov %r11d, %eax
+
+	pop	%r11
+	pop	%r12
+	ret
+1:
+	ud2
+SYM_FUNC_END(tdx_inb)
+SYM_FUNC_END(tdx_inw)
+SYM_FUNC_END(tdx_inl)
+EXPORT_SYMBOL(tdx_inb)
+EXPORT_SYMBOL(tdx_inw)
+EXPORT_SYMBOL(tdx_inl)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 12/26] x86/tdx: Handle in-kernel MMIO
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (11 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 11/26] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-04-01 19:56   ` Dave Hansen
  2021-02-05 23:38 ` [RFC v1 13/26] x86/tdx: Handle MWAIT, MONITOR and WBINVD Kuppuswamy Sathyanarayanan
                   ` (18 subsequent siblings)
  31 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
exit reason.

For now we only handle subset of instruction that kernel uses for MMIO
oerations. User-space access triggers SIGBUS.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 120 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 120 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 3846d2807a7a..eff58329751e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -6,6 +6,8 @@
 #include <linux/cpu.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
+#include <asm/insn.h>
+#include <linux/sched/signal.h> /* force_sig_fault() */
 
 #ifdef CONFIG_KVM_GUEST
 #include "tdx-kvm.c"
@@ -270,6 +272,121 @@ static void tdx_handle_io(struct pt_regs *regs, u32 exit_qual)
 	}
 }
 
+static unsigned long tdx_mmio(int size, bool write, unsigned long addr,
+		unsigned long val)
+{
+	register long r10 asm("r10") = TDVMCALL_STANDARD;
+	register long r11 asm("r11") = EXIT_REASON_EPT_VIOLATION;
+	register long r12 asm("r12") = size;
+	register long r13 asm("r13") = write;
+	register long r14 asm("r14") = addr;
+	register long r15 asm("r15") = val;
+	register long rcx asm("rcx");
+	long ret;
+
+	/* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */
+	rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15);
+
+	asm volatile(TDCALL
+			: "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13),
+			  "=r"(r14), "=r"(r15)
+			: "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
+			  "r"(r13), "r"(r14), "r"(r15)
+			: );
+
+	WARN_ON(ret || r10);
+
+	return r11;
+}
+
+static inline void *get_reg_ptr(struct pt_regs *regs, struct insn *insn)
+{
+	static const int regoff[] = {
+		offsetof(struct pt_regs, ax),
+		offsetof(struct pt_regs, cx),
+		offsetof(struct pt_regs, dx),
+		offsetof(struct pt_regs, bx),
+		offsetof(struct pt_regs, sp),
+		offsetof(struct pt_regs, bp),
+		offsetof(struct pt_regs, si),
+		offsetof(struct pt_regs, di),
+		offsetof(struct pt_regs, r8),
+		offsetof(struct pt_regs, r9),
+		offsetof(struct pt_regs, r10),
+		offsetof(struct pt_regs, r11),
+		offsetof(struct pt_regs, r12),
+		offsetof(struct pt_regs, r13),
+		offsetof(struct pt_regs, r14),
+		offsetof(struct pt_regs, r15),
+	};
+	int regno;
+
+	regno = X86_MODRM_REG(insn->modrm.value);
+	if (X86_REX_R(insn->rex_prefix.value))
+		regno += 8;
+
+	return (void *)regs + regoff[regno];
+}
+
+static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+	int size;
+	bool write;
+	unsigned long *reg;
+	struct insn insn;
+	unsigned long val = 0;
+
+	/*
+	 * User mode would mean the kernel exposed a device directly
+	 * to ring3, which shouldn't happen except for things like
+	 * DPDK.
+	 */
+	if (user_mode(regs)) {
+		pr_err("Unexpected user-mode MMIO access.\n");
+		force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);
+		return 0;
+	}
+
+	kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
+	insn_get_length(&insn);
+	insn_get_opcode(&insn);
+
+	write = ve->exit_qual & 0x2;
+
+	size = insn.opnd_bytes;
+	switch (insn.opcode.bytes[0]) {
+	/* MOV r/m8	r8	*/
+	case 0x88:
+	/* MOV r8	r/m8	*/
+	case 0x8A:
+	/* MOV r/m8	imm8	*/
+	case 0xC6:
+		size = 1;
+		break;
+	}
+
+	if (inat_has_immediate(insn.attr)) {
+		BUG_ON(!write);
+		val = insn.immediate.value;
+		tdx_mmio(size, write, ve->gpa, val);
+		return insn.length;
+	}
+
+	BUG_ON(!inat_has_modrm(insn.attr));
+
+	reg = get_reg_ptr(regs, &insn);
+
+	if (write) {
+		memcpy(&val, reg, size);
+		tdx_mmio(size, write, ve->gpa, val);
+	} else {
+		val = tdx_mmio(size, write, ve->gpa, val);
+		memset(reg, 0, size);
+		memcpy(reg, &val, size);
+	}
+	return insn.length;
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
@@ -331,6 +448,9 @@ int tdx_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_IO_INSTRUCTION:
 		tdx_handle_io(regs, ve->exit_qual);
 		break;
+	case EXIT_REASON_EPT_VIOLATION:
+		ve->instr_len = tdx_handle_mmio(regs, ve);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 13/26] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (12 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 12/26] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:43   ` Andy Lutomirski
  2021-02-05 23:38 ` [RFC v1 14/26] ACPI: tables: Add multiprocessor wake-up support Kuppuswamy Sathyanarayanan
                   ` (17 subsequent siblings)
  31 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
are not supported. So handle #VE due to these instructions as no ops.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index eff58329751e..8d1d7555fb56 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -451,6 +451,23 @@ int tdx_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_EPT_VIOLATION:
 		ve->instr_len = tdx_handle_mmio(regs, ve);
 		break;
+	/*
+	 * Per Guest-Host-Communication Interface (GHCI) for Intel Trust
+	 * Domain Extensions (Intel TDX) specification, sec 2.4,
+	 * some instructions that unconditionally cause #VE (such as WBINVD,
+	 * MONITOR, MWAIT) do not have corresponding TDCALL
+	 * [TDG.VP.VMCALL <Instruction>] leaves, since the TD has been designed
+	 * with no deterministic way to confirm the result of those operations
+	 * performed by the host VMM.  In those cases, the goal is for the TD
+	 * #VE handler to increment the RIP appropriately based on the VE
+	 * information provided via TDCALL.
+	 */
+	case EXIT_REASON_WBINVD:
+		pr_warn_once("WBINVD #VE Exception\n");
+	case EXIT_REASON_MWAIT_INSTRUCTION:
+	case EXIT_REASON_MONITOR_INSTRUCTION:
+		/* Handle as nops. */
+		break;
 	default:
 		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 14/26] ACPI: tables: Add multiprocessor wake-up support
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (13 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 13/26] x86/tdx: Handle MWAIT, MONITOR and WBINVD Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 15/26] x86/boot: Add a trampoline for APs booting in 64-bit mode Kuppuswamy Sathyanarayanan
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan, Sean Christopherson

As per Guest-Host Communication Interface (GHCI)
Specification for Intel TDX, sec 4.1, a new sub
structure – multiprocessor wake-up structure - is added to the
ACPI Multiple APIC Description Table (MADT) to describe the
information of the mailbox. If a platform firmware produces the
multiprocessor wake-up structure, then the BSP in OS may use this
new mailbox-based mechanism to wake up the APs.

Add ACPI MADT wake table parsing support and if MADT wake table is
present, update apic->wakeup_secondary_cpu with new API which
uses MADT wake mailbox to wake-up CPU.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/apic.h     |  3 ++
 arch/x86/kernel/acpi/boot.c     | 56 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/apic/probe_32.c |  8 +++++
 arch/x86/kernel/apic/probe_64.c |  8 +++++
 drivers/acpi/tables.c           |  9 ++++++
 include/acpi/actbl2.h           | 21 ++++++++++++-
 6 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 34cb3c159481..63f970c61cbe 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -497,6 +497,9 @@ static inline unsigned int read_apic_id(void)
 	return apic->get_apic_id(reg);
 }
 
+typedef int (*wakeup_cpu_handler)(int apicid, unsigned long start_eip);
+extern void acpi_wake_cpu_handler_update(wakeup_cpu_handler handler);
+
 extern int default_apic_id_valid(u32 apicid);
 extern int default_acpi_madt_oem_check(char *, char *);
 extern void default_setup_apic_routing(void);
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 7bdc0239a943..37ada1908fb7 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -65,6 +65,9 @@ int acpi_fix_pin2_polarity __initdata;
 static u64 acpi_lapic_addr __initdata = APIC_DEFAULT_PHYS_BASE;
 #endif
 
+static struct acpi_madt_mp_wake_mailbox *acpi_mp_wake_mailbox;
+static u64 acpi_mp_wake_mailbox_paddr;
+
 #ifdef CONFIG_X86_IO_APIC
 /*
  * Locks related to IOAPIC hotplug
@@ -329,6 +332,29 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e
 	return 0;
 }
 
+static void acpi_mp_wake_mailbox_init(void)
+{
+	if (acpi_mp_wake_mailbox)
+		return;
+
+	acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
+			sizeof(*acpi_mp_wake_mailbox), MEMREMAP_WB);
+}
+
+static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
+{
+	acpi_mp_wake_mailbox_init();
+
+	if (!acpi_mp_wake_mailbox)
+		return -EINVAL;
+
+	WRITE_ONCE(acpi_mp_wake_mailbox->apic_id, apicid);
+	WRITE_ONCE(acpi_mp_wake_mailbox->wakeup_vector, start_ip);
+	WRITE_ONCE(acpi_mp_wake_mailbox->command, ACPI_MP_WAKE_COMMAND_WAKEUP);
+
+	return 0;
+}
+
 #endif				/*CONFIG_X86_LOCAL_APIC */
 
 #ifdef CONFIG_X86_IO_APIC
@@ -1086,6 +1112,30 @@ static int __init acpi_parse_madt_lapic_entries(void)
 	}
 	return 0;
 }
+
+static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+				      const unsigned long end)
+{
+	struct acpi_madt_mp_wake *mp_wake = NULL;
+
+	if (!IS_ENABLED(CONFIG_SMP))
+		return -ENODEV;
+
+	mp_wake = (struct acpi_madt_mp_wake *)header;
+	if (BAD_MADT_ENTRY(mp_wake, end))
+		return -EINVAL;
+
+	if (acpi_mp_wake_mailbox)
+		return -EINVAL;
+
+	acpi_table_print_madt_entry(&header->common);
+
+	acpi_mp_wake_mailbox_paddr = mp_wake->mailbox_address;
+
+	acpi_wake_cpu_handler_update(acpi_wakeup_cpu);
+
+	return 0;
+}
 #endif				/* CONFIG_X86_LOCAL_APIC */
 
 #ifdef	CONFIG_X86_IO_APIC
@@ -1284,6 +1334,12 @@ static void __init acpi_process_madt(void)
 
 				smp_found_config = 1;
 			}
+
+			/*
+			 * Parse MADT MP Wake entry.
+			 */
+			acpi_table_parse_madt(ACPI_MADT_TYPE_MP_WAKE,
+					      acpi_parse_mp_wake, 1);
 		}
 		if (error == -EINVAL) {
 			/*
diff --git a/arch/x86/kernel/apic/probe_32.c b/arch/x86/kernel/apic/probe_32.c
index a61f642b1b90..d450014841b2 100644
--- a/arch/x86/kernel/apic/probe_32.c
+++ b/arch/x86/kernel/apic/probe_32.c
@@ -207,3 +207,11 @@ int __init default_acpi_madt_oem_check(char *oem_id, char *oem_table_id)
 	}
 	return 0;
 }
+
+void __init acpi_wake_cpu_handler_update(wakeup_cpu_handler handler)
+{
+	struct apic **drv;
+
+	for (drv = __apicdrivers; drv < __apicdrivers_end; drv++)
+		(*drv)->wakeup_secondary_cpu = handler;
+}
diff --git a/arch/x86/kernel/apic/probe_64.c b/arch/x86/kernel/apic/probe_64.c
index c46720f185c0..986dbb68d3c4 100644
--- a/arch/x86/kernel/apic/probe_64.c
+++ b/arch/x86/kernel/apic/probe_64.c
@@ -50,3 +50,11 @@ int __init default_acpi_madt_oem_check(char *oem_id, char *oem_table_id)
 	}
 	return 0;
 }
+
+void __init acpi_wake_cpu_handler_update(wakeup_cpu_handler handler)
+{
+	struct apic **drv;
+
+	for (drv = __apicdrivers; drv < __apicdrivers_end; drv++)
+		(*drv)->wakeup_secondary_cpu = handler;
+}
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index e48690a006a4..5e38748c5db1 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -207,6 +207,15 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
 		}
 		break;
 
+	case ACPI_MADT_TYPE_MP_WAKE:
+		{
+			struct acpi_madt_mp_wake *p =
+				(struct acpi_madt_mp_wake *)header;
+			pr_debug("MP Wake (version[%d] mailbox_address[%llx])\n",
+				 p->version, p->mailbox_address);
+		}
+		break;
+
 	default:
 		pr_warn("Found unsupported MADT entry (type = 0x%x)\n",
 			header->type);
diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index ec66779cb193..be953b638499 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -517,7 +517,8 @@ enum acpi_madt_type {
 	ACPI_MADT_TYPE_GENERIC_MSI_FRAME = 13,
 	ACPI_MADT_TYPE_GENERIC_REDISTRIBUTOR = 14,
 	ACPI_MADT_TYPE_GENERIC_TRANSLATOR = 15,
-	ACPI_MADT_TYPE_RESERVED = 16	/* 16 and greater are reserved */
+	ACPI_MADT_TYPE_MP_WAKE = 16,
+	ACPI_MADT_TYPE_RESERVED = 17	/* 17 and greater are reserved */
 };
 
 /*
@@ -724,6 +725,24 @@ struct acpi_madt_generic_translator {
 	u32 reserved2;
 };
 
+/* 16: MP Wake (ACPI 6.?) */
+
+struct acpi_madt_mp_wake {
+	struct acpi_subtable_header header;
+	u16 version;
+	u32 reserved2;
+	u64 mailbox_address;
+};
+
+struct acpi_madt_mp_wake_mailbox {
+	u16 command;
+	u16 flags;
+	u32 apic_id;
+	u64 wakeup_vector;
+};
+
+#define ACPI_MP_WAKE_COMMAND_WAKEUP	1
+
 /*
  * Common flags fields for MADT subtables
  */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 15/26] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (14 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 14/26] ACPI: tables: Add multiprocessor wake-up support Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 16/26] x86/boot: Avoid #VE during compressed boot for TDX platforms Kuppuswamy Sathyanarayanan
                   ` (15 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kai Huang, Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a trampoline for booting APs in 64-bit mode via a software handoff
with BIOS, and use the new trampoline for the ACPI MP wake protocol used
by TDX.

Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
mode.  For the GDT pointer, create a new entry as the existing storage
for the pointer occupies the zero entry in the GDT itself.

Reported-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/realmode.h          |  1 +
 arch/x86/kernel/smpboot.c                |  5 +++
 arch/x86/realmode/rm/header.S            |  1 +
 arch/x86/realmode/rm/trampoline_64.S     | 49 +++++++++++++++++++++++-
 arch/x86/realmode/rm/trampoline_common.S |  5 ++-
 5 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 5db5d083c873..5066c8b35e7c 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -25,6 +25,7 @@ struct real_mode_header {
 	u32	sev_es_trampoline_start;
 #endif
 #ifdef CONFIG_X86_64
+	u32	trampoline_start64;
 	u32	trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8ca66af96a54..11dd0deb4810 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1035,6 +1035,11 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
 	unsigned long boot_error = 0;
 	unsigned long timeout;
 
+#ifdef CONFIG_X86_64
+	if (is_tdx_guest())
+		start_ip = real_mode_header->trampoline_start64;
+#endif
+
 	idle->thread.sp = (unsigned long)task_pt_regs(idle);
 	early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
 	initial_code = (unsigned long)start_secondary;
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 8c1db5bf5d78..2eb62be6d256 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
 	.long	pa_sev_es_trampoline_start
 #endif
 #ifdef CONFIG_X86_64
+	.long	pa_trampoline_start64
 	.long	pa_trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 84c5d1b33d10..12b734b1da8b 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,20 @@ SYM_CODE_START(startup_32)
 	movl	%eax, %cr3
 
 	# Set up EFER
+	movl	$MSR_EFER, %ecx
+	rdmsr
+	cmp	pa_tr_efer, %eax
+	jne	.Lwrite_efer
+	cmp	pa_tr_efer + 4, %edx
+	je	.Ldone_efer
+.Lwrite_efer:
 	movl	pa_tr_efer, %eax
 	movl	pa_tr_efer + 4, %edx
-	movl	$MSR_EFER, %ecx
 	wrmsr
 
+.Ldone_efer:
 	# Enable paging and in turn activate Long Mode
-	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	/*
@@ -161,6 +168,19 @@ SYM_CODE_START(startup_32)
 	ljmpl	$__KERNEL_CS, $pa_startup_64
 SYM_CODE_END(startup_32)
 
+SYM_CODE_START(pa_trampoline_compat)
+	/*
+	 * In compatibility mode.  Prep ESP and DX for startup_32, then disable
+	 * paging and complete the switch to legacy 32-bit mode.
+	 */
+	movl	$rm_stack_end, %esp
+	movw	$__KERNEL_DS, %dx
+
+	movl	$(X86_CR0_NE | X86_CR0_PE), %eax
+	movl	%eax, %cr0
+	ljmpl   $__KERNEL32_CS, $pa_startup_32
+SYM_CODE_END(pa_trampoline_compat)
+
 	.section ".text64","ax"
 	.code64
 	.balign 4
@@ -169,6 +189,20 @@ SYM_CODE_START(startup_64)
 	jmpq	*tr_start(%rip)
 SYM_CODE_END(startup_64)
 
+SYM_CODE_START(trampoline_start64)
+	/*
+	 * APs start here on a direct transfer from 64-bit BIOS with identity
+	 * mapped page tables.  Load the kernel's GDT in order to gear down to
+	 * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
+	 * segment registers.  Load the zero IDT so any fault triggers a
+	 * shutdown instead of jumping back into BIOS.
+	 */
+	lidt	tr_idt(%rip)
+	lgdt	tr_gdt64(%rip)
+
+	ljmpl	*tr_compat(%rip)
+SYM_CODE_END(trampoline_start64)
+
 	.section ".rodata","a"
 	# Duplicate the global descriptor table
 	# so the kernel can live anywhere
@@ -182,6 +216,17 @@ SYM_DATA_START(tr_gdt)
 	.quad	0x00cf93000000ffff	# __KERNEL_DS
 SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
 
+SYM_DATA_START(tr_gdt64)
+	.short	tr_gdt_end - tr_gdt - 1	# gdt limit
+	.long	pa_tr_gdt
+	.long	0
+SYM_DATA_END(tr_gdt64)
+
+SYM_DATA_START(tr_compat)
+	.long	pa_trampoline_compat
+	.short	__KERNEL32_CS
+SYM_DATA_END(tr_compat)
+
 	.bss
 	.balign	PAGE_SIZE
 SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
index 5033e640f957..506d5897112a 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -1,4 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 	.section ".rodata","a"
 	.balign	16
-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+SYM_DATA_START_LOCAL(tr_idt)
+	.short	0
+	.quad	0
+SYM_DATA_END(tr_idt)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 16/26] x86/boot: Avoid #VE during compressed boot for TDX platforms
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (15 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 15/26] x86/boot: Add a trampoline for APs booting in 64-bit mode Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 17/26] x86/boot: Avoid unnecessary #VE during boot process Kuppuswamy Sathyanarayanan
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Avoid operations which will inject #VE during compressed
boot, which is obviously fatal for TDX platforms.

Details are,

 1. TDX module injects #VE if a TDX guest attempts to write
    EFER. So skip the WRMSR to set EFER.LME=1 if it's already
    set. TDX also forces EFER.LME=1, i.e. the branch will always
    be taken and thus the #VE avoided.

 2. TDX module also injects a #VE if the guest attempts to clear
    CR0.NE. Ensure CR0.NE is set when loading CR0 during compressed
    boot. The Setting CR0.NE should be a nop on all CPUs that
    support 64-bit mode.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S | 5 +++--
 arch/x86/boot/compressed/pgtable.h | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index e94874f4bbc1..37c2f37d4a0d 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -616,8 +616,9 @@ SYM_CODE_START(trampoline_32bit_src)
 	movl	$MSR_EFER, %ecx
 	rdmsr
 	btsl	$_EFER_LME, %eax
+	jc	1f
 	wrmsr
-	popl	%edx
+1:	popl	%edx
 	popl	%ecx
 
 	/* Enable PAE and LA57 (if required) paging modes */
@@ -636,7 +637,7 @@ SYM_CODE_START(trampoline_32bit_src)
 	pushl	%eax
 
 	/* Enable paging again */
-	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
+	movl	$(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	lret
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
 
 #define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE	0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE	0x80
 
 #define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 17/26] x86/boot: Avoid unnecessary #VE during boot process
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (16 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 16/26] x86/boot: Avoid #VE during compressed boot for TDX platforms Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 18/26] x86/topology: Disable CPU hotplug support for TDX platforms Kuppuswamy Sathyanarayanan
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Skip writing EFER during secondary_startup_64() if the current value is
also the desired value. This avoids a #VE when running as a TDX guest,
as the TDX-Module does not allow writes to EFER (even when writing the
current, fixed value).

Also, preserve CR4.MCE instead of clearing it during boot to avoid a #VE
when running as a TDX guest. The TDX-Module (effectively part of the
hypervisor) requires CR4.MCE to be set at all times and injects a #VE
if the guest attempts to clear CR4.MCE.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S |  5 ++++-
 arch/x86/kernel/head_64.S          | 13 +++++++++++--
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 37c2f37d4a0d..2d79e5f97360 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -622,7 +622,10 @@ SYM_CODE_START(trampoline_32bit_src)
 	popl	%ecx
 
 	/* Enable PAE and LA57 (if required) paging modes */
-	movl	$X86_CR4_PAE, %eax
+	movl	%cr4, %eax
+	/* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
+	andl	$X86_CR4_MCE, %eax
+	orl	$X86_CR4_PAE, %eax
 	testl	%edx, %edx
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..92c77cf75542 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 1:
 
 	/* Enable PAE mode, PGE and LA57 */
-	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+	movq	%cr4, %rcx
+	/* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
+	andl	$X86_CR4_MCE, %ecx
+	orl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
 #ifdef CONFIG_X86_5LEVEL
 	testl	$1, __pgtable_l5_enabled(%rip)
 	jz	1f
@@ -229,13 +232,19 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	/* Setup EFER (Extended Feature Enable Register) */
 	movl	$MSR_EFER, %ecx
 	rdmsr
+	movl    %eax, %edx
 	btsl	$_EFER_SCE, %eax	/* Enable System Call */
 	btl	$20,%edi		/* No Execute supported? */
 	jnc     1f
 	btsl	$_EFER_NX, %eax
 	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
-1:	wrmsr				/* Make changes effective */
 
+	/* Skip the WRMSR if the current value matches the desired value. */
+1:	cmpl	%edx, %eax
+	je	1f
+	xor	%edx, %edx
+	wrmsr				/* Make changes effective */
+1:
 	/* Setup cr0 */
 	movl	$CR0_STATE, %eax
 	/* Make changes effective */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 18/26] x86/topology: Disable CPU hotplug support for TDX platforms.
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (17 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 17/26] x86/boot: Avoid unnecessary #VE during boot process Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 19/26] x86/tdx: Forcefully disable legacy PIC for TDX guests Kuppuswamy Sathyanarayanan
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

As per Intel TDX Virtual Firmware Design Guide, sec 4.3.5 and
sec 9.4, all unused CPUs are put in spinning state by
TDVF until OS requests for CPU bring-up via mailbox address passed
by ACPI MADT table. Since by default all unused CPUs are always in
spinning state, there is no point in supporting dynamic CPU
online/offline feature. So current generation of TDVF does not
support CPU hotplug feature. It may be supported in next generation.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/tdx.c      | 14 ++++++++++++++
 arch/x86/kernel/topology.c |  3 ++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 8d1d7555fb56..a36b6ae14942 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -387,6 +387,17 @@ static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 	return insn.length;
 }
 
+static int tdx_cpu_offline_prepare(unsigned int cpu)
+{
+	/*
+	 * Per Intel TDX Virtual Firmware Design Guide,
+	 * sec 4.3.5 and sec 9.4, Hotplug is not supported
+	 * in TDX platforms. So don't support CPU
+	 * offline feature once its turned on.
+	 */
+	return -EOPNOTSUPP;
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
@@ -399,6 +410,9 @@ void __init tdx_early_init(void)
 	pv_ops.irq.safe_halt = tdx_safe_halt;
 	pv_ops.irq.halt = tdx_halt;
 
+	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug",
+			  NULL, tdx_cpu_offline_prepare);
+
 	pr_info("TDX guest is initialized\n");
 }
 
diff --git a/arch/x86/kernel/topology.c b/arch/x86/kernel/topology.c
index f5477eab5692..d879ea96d79c 100644
--- a/arch/x86/kernel/topology.c
+++ b/arch/x86/kernel/topology.c
@@ -34,6 +34,7 @@
 #include <linux/irq.h>
 #include <asm/io_apic.h>
 #include <asm/cpu.h>
+#include <asm/tdx.h>
 
 static DEFINE_PER_CPU(struct x86_cpu, cpu_devices);
 
@@ -130,7 +131,7 @@ int arch_register_cpu(int num)
 			}
 		}
 	}
-	if (num || cpu0_hotpluggable)
+	if ((num || cpu0_hotpluggable) && !is_tdx_guest())
 		per_cpu(cpu_devices, num).cpu.hotpluggable = 1;
 
 	return register_cpu(&per_cpu(cpu_devices, num).cpu, num);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 19/26] x86/tdx: Forcefully disable legacy PIC for TDX guests
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (18 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 18/26] x86/topology: Disable CPU hotplug support for TDX platforms Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 20/26] x86/tdx: Introduce INTEL_TDX_GUEST config option Kuppuswamy Sathyanarayanan
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Disable the legacy PIC (8259) for TDX guests as the PIC cannot be
supported by the VMM. TDX Module does not allow direct IRQ injection,
and using posted interrupt style delivery requires the guest to EOI
the IRQ, which diverges from the legacy PIC behavior.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index a36b6ae14942..ae37498df981 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -4,6 +4,7 @@
 #include <asm/tdx.h>
 #include <asm/cpufeature.h>
 #include <linux/cpu.h>
+#include <asm/i8259.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
 #include <asm/insn.h>
@@ -410,6 +411,8 @@ void __init tdx_early_init(void)
 	pv_ops.irq.safe_halt = tdx_safe_halt;
 	pv_ops.irq.halt = tdx_halt;
 
+	legacy_pic = &null_legacy_pic;
+
 	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug",
 			  NULL, tdx_cpu_offline_prepare);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 20/26] x86/tdx: Introduce INTEL_TDX_GUEST config option
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (19 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 19/26] x86/tdx: Forcefully disable legacy PIC for TDX guests Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code Kuppuswamy Sathyanarayanan
                   ` (10 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Add INTEL_TDX_GUEST config option to selectively compile
TDX guest support.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/Kconfig | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8fe91114bfee..0374d9f262a5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -868,6 +868,21 @@ config ACRN_GUEST
 	  IOT with small footprint and real-time features. More details can be
 	  found in https://projectacrn.org/.
 
+config INTEL_TDX_GUEST
+	bool "Intel Trusted Domain eXtensions Guest Support"
+	depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
+	depends on SECURITY
+	select PARAVIRT_XL
+	select X86_X2APIC
+	select SECURITY_LOCKDOWN_LSM
+	help
+	  Provide support for running in a trusted domain on Intel processors
+	  equipped with Trusted Domain eXtenstions. TDX is an new Intel
+	  technology that extends VMX and Memory Encryption with a new kind of
+	  virtual machine guest called Trust Domain (TD). A TD is designed to
+	  run in a CPU mode that protects the confidentiality of TD memory
+	  contents and the TD’s CPU state from other software, including VMM.
+
 endif #HYPERVISOR_GUEST
 
 source "arch/x86/Kconfig.cpu"
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (20 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 20/26] x86/tdx: Introduce INTEL_TDX_GUEST config option Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-04-01 20:06   ` Dave Hansen
  2021-02-05 23:38 ` [RFC v1 22/26] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Kuppuswamy Sathyanarayanan
                   ` (9 subsequent siblings)
  31 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Intel TDX doesn't allow VMM to access guest memory. Any memory that is
required for communication with VMM suppose to be shared explicitly by
setting the bit in page table entry. The shared memory is similar to
unencrypted memory in AMD SME/SEV terminology.

force_dma_unencrypted() has to return true for TDX guest. Move it out of
AMD SME code.

Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
selected by all x86 memory encryption features.

This is preparation for TDX changes in DMA code.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/Kconfig                 |  8 +++++--
 arch/x86/include/asm/io.h        |  4 +++-
 arch/x86/mm/Makefile             |  2 ++
 arch/x86/mm/mem_encrypt.c        | 30 -------------------------
 arch/x86/mm/mem_encrypt_common.c | 38 ++++++++++++++++++++++++++++++++
 5 files changed, 49 insertions(+), 33 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0374d9f262a5..8fa654d61ac2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1538,14 +1538,18 @@ config X86_CPA_STATISTICS
 	  helps to determine the effectiveness of preserving large and huge
 	  page mappings when mapping protections are changed.
 
+config X86_MEM_ENCRYPT_COMMON
+	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+	select DYNAMIC_PHYSICAL_MASK
+	def_bool n
+
 config AMD_MEM_ENCRYPT
 	bool "AMD Secure Memory Encryption (SME) support"
 	depends on X86_64 && CPU_SUP_AMD
 	select DMA_COHERENT_POOL
-	select DYNAMIC_PHYSICAL_MASK
 	select ARCH_USE_MEMREMAP_PROT
-	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
 	select INSTRUCTION_DECODER
+	select X86_MEM_ENCRYPT_COMMON
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 30a3b30395ad..95e534cffa99 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -257,10 +257,12 @@ static inline void slow_down_io(void)
 
 #endif
 
-#ifdef CONFIG_AMD_MEM_ENCRYPT
 #include <linux/jump_label.h>
 
 extern struct static_key_false sev_enable_key;
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
 static inline bool sev_key_active(void)
 {
 	return static_branch_unlikely(&sev_enable_key);
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
 
+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON)	+= mem_encrypt_common.o
+
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index c79e5736ab2b..11a6a7b3af7e 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -15,10 +15,6 @@
 #include <linux/dma-direct.h>
 #include <linux/swiotlb.h>
 #include <linux/mem_encrypt.h>
-#include <linux/device.h>
-#include <linux/kernel.h>
-#include <linux/bitops.h>
-#include <linux/dma-mapping.h>
 
 #include <asm/tlbflush.h>
 #include <asm/fixmap.h>
@@ -389,32 +385,6 @@ bool noinstr sev_es_active(void)
 	return sev_status & MSR_AMD64_SEV_ES_ENABLED;
 }
 
-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
-{
-	/*
-	 * For SEV, all DMA must be to unencrypted addresses.
-	 */
-	if (sev_active())
-		return true;
-
-	/*
-	 * For SME, all DMA must be to unencrypted addresses if the
-	 * device does not support DMA to addresses that include the
-	 * encryption mask.
-	 */
-	if (sme_active()) {
-		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
-		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
-						dev->bus_dma_limit);
-
-		if (dma_dev_mask <= dma_enc_mask)
-			return true;
-	}
-
-	return false;
-}
-
 void __init mem_encrypt_free_decrypted_mem(void)
 {
 	unsigned long vaddr, vaddr_end, npages;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..964e04152417
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2016 Advanced Micro Devices, Inc.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/mem_encrypt.h>
+#include <linux/dma-mapping.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+	/*
+	 * For SEV, all DMA must be to unencrypted/shared addresses.
+	 */
+	if (sev_active())
+		return true;
+
+	/*
+	 * For SME, all DMA must be to unencrypted addresses if the
+	 * device does not support DMA to addresses that include the
+	 * encryption mask.
+	 */
+	if (sme_active()) {
+		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
+		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
+						dev->bus_dma_limit);
+
+		if (dma_dev_mask <= dma_enc_mask)
+			return true;
+	}
+
+	return false;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 22/26] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (21 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-04-01 20:13   ` Dave Hansen
  2021-02-05 23:38 ` [RFC v1 23/26] x86/tdx: Make pages shared in ioremap() Kuppuswamy Sathyanarayanan
                   ` (8 subsequent siblings)
  31 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

tdx_shared_mask() returns the mask that has to be set in page table
entry to make page shared with VMM.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/Kconfig           | 1 +
 arch/x86/include/asm/tdx.h | 1 +
 arch/x86/kernel/tdx.c      | 8 ++++++++
 3 files changed, 10 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8fa654d61ac2..f10a00c4ad7f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -875,6 +875,7 @@ config INTEL_TDX_GUEST
 	select PARAVIRT_XL
 	select X86_X2APIC
 	select SECURITY_LOCKDOWN_LSM
+	select X86_MEM_ENCRYPT_COMMON
 	help
 	  Provide support for running in a trusted domain on Intel processors
 	  equipped with Trusted Domain eXtenstions. TDX is an new Intel
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b46ae140e39b..9bbfe6520ea4 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -104,5 +104,6 @@ long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
 long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
 		unsigned long p3, unsigned long p4);
 
+phys_addr_t tdx_shared_mask(void);
 #endif
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ae37498df981..9681f4a0b4e0 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -41,6 +41,11 @@ bool is_tdx_guest(void)
 }
 EXPORT_SYMBOL_GPL(is_tdx_guest);
 
+phys_addr_t tdx_shared_mask(void)
+{
+	return 1ULL << (td_info.gpa_width - 1);
+}
+
 static void tdx_get_info(void)
 {
 	register long rcx asm("rcx");
@@ -56,6 +61,9 @@ static void tdx_get_info(void)
 
 	td_info.gpa_width = rcx & GENMASK(5, 0);
 	td_info.attributes = rdx;
+
+	/* Exclude Shared bit from the __PHYSICAL_MASK */
+	physical_mask &= ~tdx_shared_mask();
 }
 
 static __cpuidle void tdx_halt(void)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 23/26] x86/tdx: Make pages shared in ioremap()
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (22 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 22/26] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-04-01 20:26   ` Dave Hansen
  2021-02-05 23:38 ` [RFC v1 24/26] x86/tdx: Add helper to do MapGPA TDVMALL Kuppuswamy Sathyanarayanan
                   ` (7 subsequent siblings)
  31 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

All ioremap()ed paged that are not backed by normal memory (NONE or
RESERVED) have to be mapped as shared.

Reuse the infrastructure we have for AMD SEV.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h | 3 +++
 arch/x86/mm/ioremap.c          | 8 +++++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..a82bab48379e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -21,6 +21,9 @@
 #define pgprot_encrypted(prot)	__pgprot(__sme_set(pgprot_val(prot)))
 #define pgprot_decrypted(prot)	__pgprot(__sme_clr(pgprot_val(prot)))
 
+/* Make the page accesable by VMM */
+#define pgprot_tdx_shared(prot) __pgprot(pgprot_val(prot) | tdx_shared_mask())
+
 #ifndef __ASSEMBLY__
 #include <asm/x86_init.h>
 #include <asm/fpu/xstate.h>
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 9e5ccc56f8e0..a0ba760866d4 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -87,12 +87,12 @@ static unsigned int __ioremap_check_ram(struct resource *res)
 }
 
 /*
- * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
- * there the whole memory is already encrypted.
+ * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or
+ * private in TDX case) because there the whole memory is already encrypted.
  */
 static unsigned int __ioremap_check_encrypted(struct resource *res)
 {
-	if (!sev_active())
+	if (!sev_active() && !is_tdx_guest())
 		return 0;
 
 	switch (res->desc) {
@@ -244,6 +244,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
 	prot = PAGE_KERNEL_IO;
 	if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
 		prot = pgprot_encrypted(prot);
+	else if (is_tdx_guest())
+		prot = pgprot_tdx_shared(prot);
 
 	switch (pcm) {
 	case _PAGE_CACHE_MODE_UC:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 24/26] x86/tdx: Add helper to do MapGPA TDVMALL
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (23 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 23/26] x86/tdx: Make pages shared in ioremap() Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-02-05 23:38 ` [RFC v1 25/26] x86/tdx: Make DMA pages shared Kuppuswamy Sathyanarayanan
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

MapGPA TDVMCALL requests the host VMM to map a GPA range as private or
shared memory mappings. Shared GPA mappings can be used for
communication beteen TD guest and host VMM, for example for
paravirtualized IO.

The new helper tdx_map_gpa() provides access to the operation.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/tdx.h |  2 ++
 arch/x86/kernel/tdx.c      | 28 ++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 9bbfe6520ea4..efffdef35c78 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -105,5 +105,7 @@ long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
 		unsigned long p3, unsigned long p4);
 
 phys_addr_t tdx_shared_mask(void);
+
+int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private);
 #endif
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 9681f4a0b4e0..f99fe54b4f88 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -14,6 +14,8 @@
 #include "tdx-kvm.c"
 #endif
 
+#define TDVMCALL_MAP_GPA	0x10001
+
 static struct {
 	unsigned int gpa_width;
 	unsigned long attributes;
@@ -66,6 +68,32 @@ static void tdx_get_info(void)
 	physical_mask &= ~tdx_shared_mask();
 }
 
+int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private)
+{
+	register long r10 asm("r10") = TDVMCALL_STANDARD;
+	register long r11 asm("r11") = TDVMCALL_MAP_GPA;
+	register long r12 asm("r12") = gpa;
+	register long r13 asm("r13") = PAGE_SIZE * numpages;
+	register long rcx asm("rcx");
+	long ret;
+
+	if (!private)
+		r12 |= tdx_shared_mask();
+
+	/* Allow to pass R10, R11, R12 and R13 down to the VMM */
+	rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13);
+
+	asm volatile(TDCALL
+			: "=a"(ret), "=r"(r10)
+			: "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
+			  "r"(r13)
+			: );
+
+	// Host kernel doesn't implement it yet.
+	// WARN_ON(ret || r10);
+	return ret || r10 ? -EIO : 0;
+}
+
 static __cpuidle void tdx_halt(void)
 {
 	register long r10 asm("r10") = TDVMCALL_STANDARD;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 25/26] x86/tdx: Make DMA pages shared
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (24 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 24/26] x86/tdx: Add helper to do MapGPA TDVMALL Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-04-01 21:01   ` Dave Hansen
  2021-02-05 23:38 ` [RFC v1 26/26] x86/kvm: Use bounce buffers for TD guest Kuppuswamy Sathyanarayanan
                   ` (5 subsequent siblings)
  31 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kai Huang, Sean Christopherson, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Make force_dma_unencrypted() return true for TDX to get DMA pages mapped
as shared.

__set_memory_enc_dec() is now aware about TDX and sets Shared bit
accordingly following with relevant TDVMCALL.

Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when
converting memory to private.  If the VMM uses a common pool for private
and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA
(or on the first access to the private GPA), in which case TDX-Module will
hold the page in a non-present "pending" state until it is explicitly
accepted by the

BUG() if TDACCEPTPAGE fails (except the above case), as the guest is
completely hosed if it can't access memory.

Tested-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/tdx.h       |  3 +++
 arch/x86/kernel/tdx.c            | 29 ++++++++++++++++++++++++++---
 arch/x86/mm/mem_encrypt_common.c |  4 ++--
 arch/x86/mm/pat/set_memory.c     | 23 ++++++++++++++++++-----
 4 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index efffdef35c78..9b66c3a5cf83 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,6 +8,9 @@
 #define TDVMCALL	0
 #define TDINFO		1
 #define TDGETVEINFO	3
+#define TDACCEPTPAGE	6
+
+#define TDX_PAGE_ALREADY_ACCEPTED       0x00000B0A00000000
 
 /* TDVMCALL R10 Input */
 #define TDVMCALL_STANDARD	0
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index f99fe54b4f88..f51a19168adc 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -68,7 +68,7 @@ static void tdx_get_info(void)
 	physical_mask &= ~tdx_shared_mask();
 }
 
-int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private)
+static int __tdx_map_gpa(phys_addr_t gpa, int numpages, bool private)
 {
 	register long r10 asm("r10") = TDVMCALL_STANDARD;
 	register long r11 asm("r11") = TDVMCALL_MAP_GPA;
@@ -89,11 +89,34 @@ int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private)
 			  "r"(r13)
 			: );
 
-	// Host kernel doesn't implement it yet.
-	// WARN_ON(ret || r10);
+	WARN_ON(ret || r10);
 	return ret || r10 ? -EIO : 0;
 }
 
+static void tdx_accept_page(phys_addr_t gpa)
+{
+	u64 ret;
+
+	asm volatile(TDCALL : "=a"(ret) : "a"(TDACCEPTPAGE), "c"(gpa));
+
+	BUG_ON(ret && (ret & ~0xffull) != TDX_PAGE_ALREADY_ACCEPTED);
+}
+
+
+int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private)
+{
+	int ret, i;
+
+	ret = __tdx_map_gpa(gpa, numpages, private);
+	if (ret || !private)
+		return ret;
+
+	for (i = 0; i < numpages; i++)
+		tdx_accept_page(gpa + i*PAGE_SIZE);
+
+	return 0;
+}
+
 static __cpuidle void tdx_halt(void)
 {
 	register long r10 asm("r10") = TDVMCALL_STANDARD;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 964e04152417..b6d93b0c5dcf 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -15,9 +15,9 @@
 bool force_dma_unencrypted(struct device *dev)
 {
 	/*
-	 * For SEV, all DMA must be to unencrypted/shared addresses.
+	 * For SEV and TDX, all DMA must be to unencrypted/shared addresses.
 	 */
-	if (sev_active())
+	if (sev_active() || is_tdx_guest())
 		return true;
 
 	/*
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..6f23a9816ef0 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -27,6 +27,7 @@
 #include <asm/proto.h>
 #include <asm/memtype.h>
 #include <asm/set_memory.h>
+#include <asm/tdx.h>
 
 #include "../mm_internal.h"
 
@@ -1977,8 +1978,8 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	struct cpa_data cpa;
 	int ret;
 
-	/* Nothing to do if memory encryption is not active */
-	if (!mem_encrypt_active())
+	/* Nothing to do if memory encryption and TDX are not active */
+	if (!mem_encrypt_active() && !is_tdx_guest())
 		return 0;
 
 	/* Should not be working on unaligned addresses */
@@ -1988,8 +1989,14 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	memset(&cpa, 0, sizeof(cpa));
 	cpa.vaddr = &addr;
 	cpa.numpages = numpages;
-	cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
-	cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+	if (is_tdx_guest()) {
+		cpa.mask_set = __pgprot(enc ? 0 : tdx_shared_mask());
+		cpa.mask_clr = __pgprot(enc ? tdx_shared_mask() : 0);
+	} else {
+		cpa.mask_set = __pgprot(enc ? _PAGE_ENC : 0);
+		cpa.mask_clr = __pgprot(enc ? 0 : _PAGE_ENC);
+	}
+
 	cpa.pgd = init_mm.pgd;
 
 	/* Must avoid aliasing mappings in the highmem code */
@@ -1999,7 +2006,8 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	/*
 	 * Before changing the encryption attribute, we need to flush caches.
 	 */
-	cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+	if (!enc || !is_tdx_guest())
+		cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
 
 	ret = __change_page_attr_set_clr(&cpa, 1);
 
@@ -2012,6 +2020,11 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	 */
 	cpa_flush(&cpa, 0);
 
+	if (!ret && is_tdx_guest()) {
+		ret = tdx_map_gpa(__pa(addr), numpages, enc);
+		// XXX: need to undo on error?
+	}
+
 	return ret;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [RFC v1 26/26] x86/kvm: Use bounce buffers for TD guest
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (25 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 25/26] x86/tdx: Make DMA pages shared Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:38 ` Kuppuswamy Sathyanarayanan
  2021-04-01 21:17   ` Dave Hansen
  2021-02-06  3:04 ` Test Email sathyanarayanan.kuppuswamy
                   ` (4 subsequent siblings)
  31 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-02-05 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

TDX doesn't allow to perform DMA access to guest private memory.
In order for DMA to work properly in TD guest, user SWIOTLB bounce
buffers.

Move AMD SEV initialization into common code and adopt for TDX.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/pci-swiotlb.c    |  2 +-
 arch/x86/kernel/tdx.c            |  3 +++
 arch/x86/mm/mem_encrypt.c        | 44 -------------------------------
 arch/x86/mm/mem_encrypt_common.c | 45 ++++++++++++++++++++++++++++++++
 4 files changed, 49 insertions(+), 45 deletions(-)

diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index c2cfa5e7c152..020e13749758 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -49,7 +49,7 @@ int __init pci_swiotlb_detect_4gb(void)
 	 * buffers are allocated and used for devices that do not support
 	 * the addressing range required for the encryption mask.
 	 */
-	if (sme_active())
+	if (sme_active() || is_tdx_guest())
 		swiotlb = 1;
 
 	return swiotlb;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index f51a19168adc..ccb9401bd706 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -9,6 +9,7 @@
 #include <asm/vmx.h>
 #include <asm/insn.h>
 #include <linux/sched/signal.h> /* force_sig_fault() */
+#include <linux/swiotlb.h>
 
 #ifdef CONFIG_KVM_GUEST
 #include "tdx-kvm.c"
@@ -472,6 +473,8 @@ void __init tdx_early_init(void)
 
 	legacy_pic = &null_legacy_pic;
 
+	swiotlb_force = SWIOTLB_FORCE;
+
 	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug",
 			  NULL, tdx_cpu_offline_prepare);
 
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 11a6a7b3af7e..7fbbb2f3d426 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -408,47 +408,3 @@ void __init mem_encrypt_free_decrypted_mem(void)
 
 	free_init_pages("unused decrypted", vaddr, vaddr_end);
 }
-
-static void print_mem_encrypt_feature_info(void)
-{
-	pr_info("AMD Memory Encryption Features active:");
-
-	/* Secure Memory Encryption */
-	if (sme_active()) {
-		/*
-		 * SME is mutually exclusive with any of the SEV
-		 * features below.
-		 */
-		pr_cont(" SME\n");
-		return;
-	}
-
-	/* Secure Encrypted Virtualization */
-	if (sev_active())
-		pr_cont(" SEV");
-
-	/* Encrypted Register State */
-	if (sev_es_active())
-		pr_cont(" SEV-ES");
-
-	pr_cont("\n");
-}
-
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
-{
-	if (!sme_me_mask)
-		return;
-
-	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
-	swiotlb_update_mem_attributes();
-
-	/*
-	 * With SEV, we need to unroll the rep string I/O instructions.
-	 */
-	if (sev_active())
-		static_branch_enable(&sev_enable_key);
-
-	print_mem_encrypt_feature_info();
-}
-
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index b6d93b0c5dcf..6f3d90d4d68e 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,6 +10,7 @@
 #include <linux/mm.h>
 #include <linux/mem_encrypt.h>
 #include <linux/dma-mapping.h>
+#include <linux/swiotlb.h>
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
@@ -36,3 +37,47 @@ bool force_dma_unencrypted(struct device *dev)
 
 	return false;
 }
+
+static void print_mem_encrypt_feature_info(void)
+{
+	pr_info("AMD Memory Encryption Features active:");
+
+	/* Secure Memory Encryption */
+	if (sme_active()) {
+		/*
+		 * SME is mutually exclusive with any of the SEV
+		 * features below.
+		 */
+		pr_cont(" SME\n");
+		return;
+	}
+
+	/* Secure Encrypted Virtualization */
+	if (sev_active())
+		pr_cont(" SEV");
+
+	/* Encrypted Register State */
+	if (sev_es_active())
+		pr_cont(" SEV-ES");
+
+	pr_cont("\n");
+}
+
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void)
+{
+	if (!sme_me_mask && !is_tdx_guest())
+		return;
+
+	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
+	swiotlb_update_mem_attributes();
+
+	/*
+	 * With SEV, we need to unroll the rep string I/O instructions.
+	 */
+	if (sev_active())
+		static_branch_enable(&sev_enable_key);
+
+	if (!is_tdx_guest())
+		print_mem_encrypt_feature_info();
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [RFC v1 09/26] x86/tdx: Handle CPUID via #VE
  2021-02-05 23:38 ` [RFC v1 09/26] x86/tdx: Handle CPUID via #VE Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:42   ` Andy Lutomirski
  2021-02-07 14:13     ` Kirill A. Shutemov
  0 siblings, 1 reply; 161+ messages in thread
From: Andy Lutomirski @ 2021-02-05 23:42 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, LKML

On Fri, Feb 5, 2021 at 3:39 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> TDX has three classes of CPUID leaves: some CPUID leaves
> are always handled by the CPU, others are handled by the TDX module,
> and some others are handled by the VMM. Since the VMM cannot directly
> intercept the instruction these are reflected with a #VE exception
> to the guest, which then converts it into a TDCALL to the VMM,
> or handled directly.
>
> The TDX module EAS has a full list of CPUID leaves which are handled
> natively or by the TDX module in 16.2. Only unknown CPUIDs are handled by
> the #VE method. In practice this typically only applies to the
> hypervisor specific CPUIDs unknown to the native CPU.
>
> Therefore there is no risk of causing this in early CPUID code which
> runs before the #VE handler is set up because it will never access
> those exotic CPUID leaves.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/kernel/tdx.c | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
>
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 5d961263601e..e98058c048b5 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -172,6 +172,35 @@ static int tdx_write_msr_safe(unsigned int msr, unsigned int low,
>         return ret || r10 ? -EIO : 0;
>  }
>
> +static void tdx_handle_cpuid(struct pt_regs *regs)
> +{
> +       register long r10 asm("r10") = TDVMCALL_STANDARD;
> +       register long r11 asm("r11") = EXIT_REASON_CPUID;
> +       register long r12 asm("r12") = regs->ax;
> +       register long r13 asm("r13") = regs->cx;
> +       register long r14 asm("r14");
> +       register long r15 asm("r15");
> +       register long rcx asm("rcx");
> +       long ret;
> +
> +       /* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */
> +       rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15);
> +
> +       asm volatile(TDCALL
> +                       : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13),
> +                         "=r"(r14), "=r"(r15)
> +                       : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
> +                         "r"(r13)
> +                       : );

Some "+" constraints would make this simpler.  But I think you should
factor the TDCALL helper out into its own function.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 13/26] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-02-05 23:38 ` [RFC v1 13/26] x86/tdx: Handle MWAIT, MONITOR and WBINVD Kuppuswamy Sathyanarayanan
@ 2021-02-05 23:43   ` Andy Lutomirski
  2021-02-05 23:54     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Andy Lutomirski @ 2021-02-05 23:43 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, LKML

On Fri, Feb 5, 2021 at 3:39 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> are not supported. So handle #VE due to these instructions as no ops.
>

MWAIT turning into NOP is no good.  How about suppressing
X86_FEATURE_MWAIT instead?

--Andy

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 13/26] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-02-05 23:43   ` Andy Lutomirski
@ 2021-02-05 23:54     ` Kuppuswamy, Sathyanarayanan
  2021-02-06  1:05       ` Andy Lutomirski
  0 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-02-05 23:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Dave Hansen, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, LKML

Hi Andy,

On 2/5/21 3:43 PM, Andy Lutomirski wrote:
> MWAIT turning into NOP is no good.  How about suppressing
> X86_FEATURE_MWAIT instead?
Yes, we can suppress it in tdx_early_init().

  + setup_clear_cpu_cap(X86_FEATURE_MWAIT);

But do you want to leave the MWAIT #VE handler as it as
(just in case)?


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 13/26] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-02-05 23:54     ` Kuppuswamy, Sathyanarayanan
@ 2021-02-06  1:05       ` Andy Lutomirski
  2021-03-27  0:18         ` [PATCH v1 1/1] " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Andy Lutomirski @ 2021-02-06  1:05 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Andy Lutomirski, Peter Zijlstra, Dave Hansen, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, LKML

On Fri, Feb 5, 2021 at 3:54 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> Hi Andy,
>
> On 2/5/21 3:43 PM, Andy Lutomirski wrote:
> > MWAIT turning into NOP is no good.  How about suppressing
> > X86_FEATURE_MWAIT instead?
> Yes, we can suppress it in tdx_early_init().
>
>   + setup_clear_cpu_cap(X86_FEATURE_MWAIT);
>
> But do you want to leave the MWAIT #VE handler as it as
> (just in case)?
>

I would suggest decoding the error, printing a useful message, and
oopsing or at least warning.

>
> --
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Test Email
@ 2021-02-06  3:02 sathyanarayanan.kuppuswamy
  2021-02-05 23:38 ` [RFC v1 00/26] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (31 more replies)
  0 siblings, 32 replies; 161+ messages in thread
From: sathyanarayanan.kuppuswamy @ 2021-02-06  3:02 UTC (permalink / raw)
  To: linux-kernel

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Hi All,

Sending a test email to verify my mail server. please ignore it.

-- 
2.25.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Test Email
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (26 preceding siblings ...)
  2021-02-05 23:38 ` [RFC v1 26/26] x86/kvm: Use bounce buffers for TD guest Kuppuswamy Sathyanarayanan
@ 2021-02-06  3:04 ` sathyanarayanan.kuppuswamy
  2021-02-06  6:24 ` [RFC v1 00/26] Add TDX Guest Support sathyanarayanan.kuppuswamy
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: sathyanarayanan.kuppuswamy @ 2021-02-06  3:04 UTC (permalink / raw)
  To: linux-kernel

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Hi All,

Sending a test email to verify my mail server. please ignore it.

-- 
2.25.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC v1 00/26] Add TDX Guest Support
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (27 preceding siblings ...)
  2021-02-06  3:04 ` Test Email sathyanarayanan.kuppuswamy
@ 2021-02-06  6:24 ` sathyanarayanan.kuppuswamy
  2021-03-31 21:38 ` Kuppuswamy, Sathyanarayanan
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 161+ messages in thread
From: sathyanarayanan.kuppuswamy @ 2021-02-06  6:24 UTC (permalink / raw)
  To: linux-kernel

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Hi All,

NOTE: This series is not ready for wide public review. It is being
specifically posted so that Peter Z and other experts on the entry
code can look for problems with the new exception handler (#VE).
That's also why x86@ is not being spammed.

Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. This series adds the bare-minimum
support to run a TDX guest. The host-side support will be submitted
separately. Also support for advanced TD guest features like attestation
or debug-mode will be submitted separately. Also, at this point it is not
secure with some known holes in drivers, and also hasn’t been fully audited
and fuzzed yet.

TDX has a lot of similarities to SEV. It enhances confidentiality and
of guest memory and state (like registers) and includes a new exception
(#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
yet), TDX limits the host's ability to effect changes in the guest
physical address space.

In contrast to the SEV code in the kernel, TDX guest memory is integrity
protected and isolated; the host is prevented from accessing guest
memory (even ciphertext).

The TDX architecture also includes a new CPU mode called
Secure-Arbitration Mode (SEAM). The software (TDX module) running in this
mode arbitrates interactions between host and guest and implements many of
the guarantees of the TDX architecture.

Some of the key differences between TD and regular VM is,

1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
2. A new #VE exception handler is added. The TDX module injects #VE exception
   to the guest TD in cases of instructions that need to be emulated, disallowed
   MSR accesses, subset of CPUID leaves, etc.
3. By default memory is marked as private, and TD will selectively share it with
   VMM based on need.
4. Remote attestation is supported to enable a third party (either the owner of
   the workload or a user of the services provided by the workload) to establish
   that the workload is running on an Intel-TDX-enabled platform located within a
   TD prior to providing that workload data.

You can find TDX related documents in the following link.

https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html

This RFC series has been reviewed by Dave Hansen.

Kirill A. Shutemov (16):
  x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  x86/tdx: Get TD execution environment information via TDINFO
  x86/traps: Add #VE support for TDX guest
  x86/tdx: Add HLT support for TDX guest
  x86/tdx: Wire up KVM hypercalls
  x86/tdx: Add MSR support for TDX guest
  x86/tdx: Handle CPUID via #VE
  x86/io: Allow to override inX() and outX() implementation
  x86/tdx: Handle port I/O
  x86/tdx: Handle in-kernel MMIO
  x86/mm: Move force_dma_unencrypted() to common code
  x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  x86/tdx: Make pages shared in ioremap()
  x86/tdx: Add helper to do MapGPA TDVMALL
  x86/tdx: Make DMA pages shared
  x86/kvm: Use bounce buffers for TD guest

Kuppuswamy Sathyanarayanan (6):
  x86/cpufeatures: Add TDX Guest CPU feature
  x86/cpufeatures: Add is_tdx_guest() interface
  x86/tdx: Handle MWAIT, MONITOR and WBINVD
  ACPI: tables: Add multiprocessor wake-up support
  x86/topology: Disable CPU hotplug support for TDX platforms.
  x86/tdx: Introduce INTEL_TDX_GUEST config option

Sean Christopherson (4):
  x86/boot: Add a trampoline for APs booting in 64-bit mode
  x86/boot: Avoid #VE during compressed boot for TDX platforms
  x86/boot: Avoid unnecessary #VE during boot process
  x86/tdx: Forcefully disable legacy PIC for TDX guests

 arch/x86/Kconfig                         |  28 +-
 arch/x86/boot/compressed/Makefile        |   2 +
 arch/x86/boot/compressed/head_64.S       |  10 +-
 arch/x86/boot/compressed/misc.h          |   1 +
 arch/x86/boot/compressed/pgtable.h       |   2 +-
 arch/x86/boot/compressed/tdx.c           |  32 ++
 arch/x86/boot/compressed/tdx_io.S        |   9 +
 arch/x86/include/asm/apic.h              |   3 +
 arch/x86/include/asm/asm-prototypes.h    |   1 +
 arch/x86/include/asm/cpufeatures.h       |   1 +
 arch/x86/include/asm/idtentry.h          |   4 +
 arch/x86/include/asm/io.h                |  25 +-
 arch/x86/include/asm/irqflags.h          |  42 +-
 arch/x86/include/asm/kvm_para.h          |  21 +
 arch/x86/include/asm/paravirt.h          |  22 +-
 arch/x86/include/asm/paravirt_types.h    |   3 +-
 arch/x86/include/asm/pgtable.h           |   3 +
 arch/x86/include/asm/realmode.h          |   1 +
 arch/x86/include/asm/tdx.h               | 114 +++++
 arch/x86/kernel/Makefile                 |   1 +
 arch/x86/kernel/acpi/boot.c              |  56 +++
 arch/x86/kernel/apic/probe_32.c          |   8 +
 arch/x86/kernel/apic/probe_64.c          |   8 +
 arch/x86/kernel/head64.c                 |   3 +
 arch/x86/kernel/head_64.S                |  13 +-
 arch/x86/kernel/idt.c                    |   6 +
 arch/x86/kernel/paravirt.c               |   4 +-
 arch/x86/kernel/pci-swiotlb.c            |   2 +-
 arch/x86/kernel/smpboot.c                |   5 +
 arch/x86/kernel/tdx-kvm.c                | 116 +++++
 arch/x86/kernel/tdx.c                    | 560 +++++++++++++++++++++++
 arch/x86/kernel/tdx_io.S                 | 143 ++++++
 arch/x86/kernel/topology.c               |   3 +-
 arch/x86/kernel/traps.c                  |  73 ++-
 arch/x86/mm/Makefile                     |   2 +
 arch/x86/mm/ioremap.c                    |   8 +-
 arch/x86/mm/mem_encrypt.c                |  74 ---
 arch/x86/mm/mem_encrypt_common.c         |  83 ++++
 arch/x86/mm/mem_encrypt_identity.c       |   1 +
 arch/x86/mm/pat/set_memory.c             |  23 +-
 arch/x86/realmode/rm/header.S            |   1 +
 arch/x86/realmode/rm/trampoline_64.S     |  49 +-
 arch/x86/realmode/rm/trampoline_common.S |   5 +-
 drivers/acpi/tables.c                    |   9 +
 include/acpi/actbl2.h                    |  21 +-
 45 files changed, 1444 insertions(+), 157 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdx.c
 create mode 100644 arch/x86/boot/compressed/tdx_io.S
 create mode 100644 arch/x86/include/asm/tdx.h
 create mode 100644 arch/x86/kernel/tdx-kvm.c
 create mode 100644 arch/x86/kernel/tdx.c
 create mode 100644 arch/x86/kernel/tdx_io.S
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 09/26] x86/tdx: Handle CPUID via #VE
  2021-02-05 23:42   ` Andy Lutomirski
@ 2021-02-07 14:13     ` Kirill A. Shutemov
  2021-02-07 16:01       ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Kirill A. Shutemov @ 2021-02-07 14:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Dave Hansen,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, LKML

On Fri, Feb 05, 2021 at 03:42:01PM -0800, Andy Lutomirski wrote:
> On Fri, Feb 5, 2021 at 3:39 PM Kuppuswamy Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > TDX has three classes of CPUID leaves: some CPUID leaves
> > are always handled by the CPU, others are handled by the TDX module,
> > and some others are handled by the VMM. Since the VMM cannot directly
> > intercept the instruction these are reflected with a #VE exception
> > to the guest, which then converts it into a TDCALL to the VMM,
> > or handled directly.
> >
> > The TDX module EAS has a full list of CPUID leaves which are handled
> > natively or by the TDX module in 16.2. Only unknown CPUIDs are handled by
> > the #VE method. In practice this typically only applies to the
> > hypervisor specific CPUIDs unknown to the native CPU.
> >
> > Therefore there is no risk of causing this in early CPUID code which
> > runs before the #VE handler is set up because it will never access
> > those exotic CPUID leaves.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Reviewed-by: Andi Kleen <ak@linux.intel.com>
> > Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> > ---
> >  arch/x86/kernel/tdx.c | 32 ++++++++++++++++++++++++++++++++
> >  1 file changed, 32 insertions(+)
> >
> > diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> > index 5d961263601e..e98058c048b5 100644
> > --- a/arch/x86/kernel/tdx.c
> > +++ b/arch/x86/kernel/tdx.c
> > @@ -172,6 +172,35 @@ static int tdx_write_msr_safe(unsigned int msr, unsigned int low,
> >         return ret || r10 ? -EIO : 0;
> >  }
> >
> > +static void tdx_handle_cpuid(struct pt_regs *regs)
> > +{
> > +       register long r10 asm("r10") = TDVMCALL_STANDARD;
> > +       register long r11 asm("r11") = EXIT_REASON_CPUID;
> > +       register long r12 asm("r12") = regs->ax;
> > +       register long r13 asm("r13") = regs->cx;
> > +       register long r14 asm("r14");
> > +       register long r15 asm("r15");
> > +       register long rcx asm("rcx");
> > +       long ret;
> > +
> > +       /* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */
> > +       rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15);
> > +
> > +       asm volatile(TDCALL
> > +                       : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13),
> > +                         "=r"(r14), "=r"(r15)
> > +                       : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
> > +                         "r"(r13)
> > +                       : );
> 
> Some "+" constraints would make this simpler.  But I think you should
> factor the TDCALL helper out into its own function.

Factor out TDCALL into a helper is tricky: different TDCALLs have
different list of registers passed to VMM.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 09/26] x86/tdx: Handle CPUID via #VE
  2021-02-07 14:13     ` Kirill A. Shutemov
@ 2021-02-07 16:01       ` Dave Hansen
  2021-02-07 20:29         ` Kirill A. Shutemov
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-02-07 16:01 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andy Lutomirski
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, LKML

On 2/7/21 6:13 AM, Kirill A. Shutemov wrote:
>>> +       /* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */
>>> +       rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15);
>>> +
>>> +       asm volatile(TDCALL
>>> +                       : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13),
>>> +                         "=r"(r14), "=r"(r15)
>>> +                       : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
>>> +                         "r"(r13)
>>> +                       : );
>> Some "+" constraints would make this simpler.  But I think you should
>> factor the TDCALL helper out into its own function.
> Factor out TDCALL into a helper is tricky: different TDCALLs have
> different list of registers passed to VMM.

Couldn't you just have one big helper that takes *all* the registers
that get used in any TDVMCALL and sets all the rcx bits?  The users
could just pass 0's for the things they don't use.

Then you've got the ugly inline asm in one place.  It also makes it
harder to screw up the 'rcx' mask and end up passing registers you
didn't want into a malicious VMM.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 09/26] x86/tdx: Handle CPUID via #VE
  2021-02-07 16:01       ` Dave Hansen
@ 2021-02-07 20:29         ` Kirill A. Shutemov
  2021-02-07 22:31           ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Kirill A. Shutemov @ 2021-02-07 20:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, LKML

On Sun, Feb 07, 2021 at 08:01:50AM -0800, Dave Hansen wrote:
> On 2/7/21 6:13 AM, Kirill A. Shutemov wrote:
> >>> +       /* Allow to pass R10, R11, R12, R13, R14 and R15 down to the VMM */
> >>> +       rcx = BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15);
> >>> +
> >>> +       asm volatile(TDCALL
> >>> +                       : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12), "=r"(r13),
> >>> +                         "=r"(r14), "=r"(r15)
> >>> +                       : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12),
> >>> +                         "r"(r13)
> >>> +                       : );
> >> Some "+" constraints would make this simpler.  But I think you should
> >> factor the TDCALL helper out into its own function.
> > Factor out TDCALL into a helper is tricky: different TDCALLs have
> > different list of registers passed to VMM.
> 
> Couldn't you just have one big helper that takes *all* the registers
> that get used in any TDVMCALL and sets all the rcx bits?  The users
> could just pass 0's for the things they don't use.
> 
> Then you've got the ugly inline asm in one place.  It also makes it
> harder to screw up the 'rcx' mask and end up passing registers you
> didn't want into a malicious VMM.

For now we only pass down R10-R15, but the interface allows to pass down
much wider set of registers, including XMM. How far do we want to get it?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 09/26] x86/tdx: Handle CPUID via #VE
  2021-02-07 20:29         ` Kirill A. Shutemov
@ 2021-02-07 22:31           ` Dave Hansen
  2021-02-07 22:45             ` Andy Lutomirski
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-02-07 22:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, LKML

On 2/7/21 12:29 PM, Kirill A. Shutemov wrote:
>> Couldn't you just have one big helper that takes *all* the registers
>> that get used in any TDVMCALL and sets all the rcx bits?  The users
>> could just pass 0's for the things they don't use.
>>
>> Then you've got the ugly inline asm in one place.  It also makes it
>> harder to screw up the 'rcx' mask and end up passing registers you
>> didn't want into a malicious VMM.
> For now we only pass down R10-R15, but the interface allows to pass down
> much wider set of registers, including XMM. How far do we want to get it?

Just do what we immediately need: R10-R15.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 09/26] x86/tdx: Handle CPUID via #VE
  2021-02-07 22:31           ` Dave Hansen
@ 2021-02-07 22:45             ` Andy Lutomirski
  2021-02-08 17:10               ` Sean Christopherson
  2021-03-18 21:30               ` [PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions Kuppuswamy Sathyanarayanan
  0 siblings, 2 replies; 161+ messages in thread
From: Andy Lutomirski @ 2021-02-07 22:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andy Lutomirski, Kuppuswamy Sathyanarayanan,
	Peter Zijlstra, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, LKML


> On Feb 7, 2021, at 2:31 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 2/7/21 12:29 PM, Kirill A. Shutemov wrote:
>>> Couldn't you just have one big helper that takes *all* the registers
>>> that get used in any TDVMCALL and sets all the rcx bits?  The users
>>> could just pass 0's for the things they don't use.
>>> 
>>> Then you've got the ugly inline asm in one place.  It also makes it
>>> harder to screw up the 'rcx' mask and end up passing registers you
>>> didn't want into a malicious VMM.
>> For now we only pass down R10-R15, but the interface allows to pass down
>> much wider set of registers, including XMM. How far do we want to get it?
> 
> Just do what we immediately need: R10-R15
> .
> 

How much of the register state is revealed to the VMM when we do a TDVMCALL?  Presumably we should fully sanitize all register state that shows up in cleartext on the other end, and we should treat all regs that can be modified by the VMM as clobbered.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 04/26] x86/tdx: Get TD execution environment information via TDINFO
  2021-02-05 23:38 ` [RFC v1 04/26] x86/tdx: Get TD execution environment information via TDINFO Kuppuswamy Sathyanarayanan
@ 2021-02-08 10:00   ` Peter Zijlstra
  2021-02-08 19:10     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2021-02-08 10:00 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Andy Lutomirski, Dave Hansen, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

On Fri, Feb 05, 2021 at 03:38:21PM -0800, Kuppuswamy Sathyanarayanan wrote:
> +/*
> + * TDCALL instruction is newly added in TDX architecture,
> + * used by TD for requesting the host VMM to provide
> + * (untrusted) services.
> + */
> +#define TDCALL	".byte 0x66,0x0f,0x01,0xcc"

This needs a binutils version number.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-05 23:38 ` [RFC v1 05/26] x86/traps: Add #VE support for TDX guest Kuppuswamy Sathyanarayanan
@ 2021-02-08 10:20   ` Peter Zijlstra
  2021-02-08 16:23     ` Andi Kleen
  2021-02-12 19:20   ` Dave Hansen
  2021-02-12 19:47   ` Andy Lutomirski
  2 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2021-02-08 10:20 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Andy Lutomirski, Dave Hansen, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson

On Fri, Feb 05, 2021 at 03:38:22PM -0800, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> The TDX module injects #VE exception to the guest TD in cases of
> disallowed instructions, disallowed MSR accesses and subset of CPUID
> leaves. Also, it's theoretically possible for CPU to inject #VE
> exception on EPT violation, but the TDX module makes sure this does
> not happen, as long as all memory used is properly accepted using
> TDCALLs. You can find more details about it in, Guest-Host-Communication
> Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX)
> specification, sec 2.3.
> 
> Add basic infrastructure to handle #VE. If there is no handler for a
> given #VE, since its a unexpected event (fault case), treat it as a
> general protection fault and handle it using do_general_protection()
> call.
> 
> TDCALL[TDGETVEINFO] provides information about #VE such as exit reason.
> 
> More details on cases where #VE exceptions are allowed/not-allowed:
> 
> The #VE exception do not occur in the paranoid entry paths, like NMIs.
> While other operations during an NMI might cause #VE, these are in the
> NMI code that can handle nesting, so there is no concern about
> reentrancy. This is similar to how #PF is handled in NMIs.
> 
> The #VE exception also cannot happen in entry/exit code with the
> wrong gs, such as the SWAPGS code, so it's entry point does not
> need "paranoid" handling.

All of the above are arranged by using the below secure EPT for init
text and data?

> Any memory accesses can cause #VE if it causes an EPT
> violation.  However, the VMM is only in direct control of some of the
> EPT tables.  The Secure EPT tables are controlled by the TDX module
> which guarantees no EPT violations will result in #VE for the guest,
> once the memory has been accepted.

Which is supposedly then set up to avoid #VE during the syscall gap,
yes? Which then results in #VE not having to be IST.

> +#ifdef CONFIG_INTEL_TDX_GUEST
> +DEFINE_IDTENTRY(exc_virtualization_exception)
> +{
> +	struct ve_info ve;
> +	int ret;
> +
> +	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> +
> +	/* Consume #VE info before re-enabling interrupts */

So what happens if NMI happens here, and triggers a nested #VE ?

> +	ret = tdx_get_ve_info(&ve);
> +	cond_local_irq_enable(regs);
> +	if (!ret)
> +		ret = tdx_handle_virtualization_exception(regs, &ve);
> +	/*
> +	 * If #VE exception handler could not handle it successfully, treat
> +	 * it as #GP(0) and handle it.
> +	 */
> +	if (ret)
> +		do_general_protection(regs, 0);
> +	cond_local_irq_disable(regs);
> +}
> +#endif

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-08 10:20   ` Peter Zijlstra
@ 2021-02-08 16:23     ` Andi Kleen
  2021-02-08 16:33       ` Peter Zijlstra
  0 siblings, 1 reply; 161+ messages in thread
From: Andi Kleen @ 2021-02-08 16:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kuppuswamy Sathyanarayanan, Andy Lutomirski, Dave Hansen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson

> Which is supposedly then set up to avoid #VE during the syscall gap,
> yes? Which then results in #VE not having to be IST.

Yes that is currently true because all memory is pre-accepted.

If we ever do lazy accept we would need to make sure the memory accessed in
the syscall gap is already accepted, or move over to an IST.

> > +#ifdef CONFIG_INTEL_TDX_GUEST
> > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > +{
> > +	struct ve_info ve;
> > +	int ret;
> > +
> > +	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> > +
> > +	/* Consume #VE info before re-enabling interrupts */
> 
> So what happens if NMI happens here, and triggers a nested #VE ?

Yes that's a gap. We should probably bail out and reexecute the original
instruction. The VE handler would need to set a flag for that.

Or alternatively the NMI always gets the VE information and puts
it on some internal stack, but that would seem clunkier.


-Andi

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-08 16:23     ` Andi Kleen
@ 2021-02-08 16:33       ` Peter Zijlstra
  2021-02-08 16:46         ` Sean Christopherson
  2021-02-08 16:46         ` Andi Kleen
  0 siblings, 2 replies; 161+ messages in thread
From: Peter Zijlstra @ 2021-02-08 16:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy Sathyanarayanan, Andy Lutomirski, Dave Hansen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson

On Mon, Feb 08, 2021 at 08:23:01AM -0800, Andi Kleen wrote:
> > Which is supposedly then set up to avoid #VE during the syscall gap,
> > yes? Which then results in #VE not having to be IST.
> 
> Yes that is currently true because all memory is pre-accepted.
> 
> If we ever do lazy accept we would need to make sure the memory accessed in
> the syscall gap is already accepted, or move over to an IST.

I think we're going to mandate the entry text/data will have to be
pre-accepted to avoid IST. ISTs really are crap.

> > > +#ifdef CONFIG_INTEL_TDX_GUEST
> > > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > > +{
> > > +	struct ve_info ve;
> > > +	int ret;
> > > +
> > > +	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> > > +
> > > +	/* Consume #VE info before re-enabling interrupts */
> > 
> > So what happens if NMI happens here, and triggers a nested #VE ?
> 
> Yes that's a gap. We should probably bail out and reexecute the original
> instruction. The VE handler would need to set a flag for that.
> 
> Or alternatively the NMI always gets the VE information and puts
> it on some internal stack, but that would seem clunkier.

The same is possible with MCE and #DB I imagine.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-08 16:33       ` Peter Zijlstra
@ 2021-02-08 16:46         ` Sean Christopherson
  2021-02-08 16:59           ` Peter Zijlstra
  2021-02-08 16:46         ` Andi Kleen
  1 sibling, 1 reply; 161+ messages in thread
From: Sean Christopherson @ 2021-02-08 16:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, Andy Lutomirski,
	Dave Hansen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel, Sean Christopherson

On Mon, Feb 08, 2021, Peter Zijlstra wrote:
> On Mon, Feb 08, 2021 at 08:23:01AM -0800, Andi Kleen wrote:
> > > > +#ifdef CONFIG_INTEL_TDX_GUEST
> > > > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > > > +{
> > > > +	struct ve_info ve;
> > > > +	int ret;
> > > > +
> > > > +	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> > > > +
> > > > +	/* Consume #VE info before re-enabling interrupts */
> > > 
> > > So what happens if NMI happens here, and triggers a nested #VE ?
> > 
> > Yes that's a gap. We should probably bail out and reexecute the original
> > instruction. The VE handler would need to set a flag for that.

No, NMI cannot happen here.  The TDX-Module "blocks" NMIs until the #VE info is
consumed by the guest.

> > Or alternatively the NMI always gets the VE information and puts
> > it on some internal stack, but that would seem clunkier.
> 
> The same is possible with MCE and #DB I imagine.

The MCE "architecture" for a TDX guest is rather stupid.  The guest is required
to keep CR4.MCE=1, but at least for TDX 1.0 the VMM is not allowed to inject #MC.
So, for better or worse, #MC is a non-issue.

#VE->#DB->#VE would be an issue, presumably this needs to be noinstr (or whatever
it is that prevents #DBs on functions).

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-08 16:33       ` Peter Zijlstra
  2021-02-08 16:46         ` Sean Christopherson
@ 2021-02-08 16:46         ` Andi Kleen
  1 sibling, 0 replies; 161+ messages in thread
From: Andi Kleen @ 2021-02-08 16:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kuppuswamy Sathyanarayanan, Andy Lutomirski, Dave Hansen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson

> > > So what happens if NMI happens here, and triggers a nested #VE ?
> > 
> > Yes that's a gap. We should probably bail out and reexecute the original
> > instruction. The VE handler would need to set a flag for that.
> > 
> > Or alternatively the NMI always gets the VE information and puts
> > it on some internal stack, but that would seem clunkier.
> 
> The same is possible with MCE and #DB I imagine.

I don't think there are currently any plans to inject #MC into TDX guests. It's
doubtful this could be done securely.

#DB is trickier because it will happen every time, so simply reexecuting
won't work. I guess it would need the ve info stack, or some care in kprobes/kernel
debugger that it cannot happen. I think I would prefer the later.

-Andi


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-08 16:46         ` Sean Christopherson
@ 2021-02-08 16:59           ` Peter Zijlstra
  2021-02-08 19:05             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2021-02-08 16:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, Andy Lutomirski,
	Dave Hansen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel, Sean Christopherson

On Mon, Feb 08, 2021 at 08:46:23AM -0800, Sean Christopherson wrote:
> On Mon, Feb 08, 2021, Peter Zijlstra wrote:
> > On Mon, Feb 08, 2021 at 08:23:01AM -0800, Andi Kleen wrote:
> > > > > +#ifdef CONFIG_INTEL_TDX_GUEST
> > > > > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > > > > +{
> > > > > +	struct ve_info ve;
> > > > > +	int ret;
> > > > > +
> > > > > +	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> > > > > +
> > > > > +	/* Consume #VE info before re-enabling interrupts */
> > > > 
> > > > So what happens if NMI happens here, and triggers a nested #VE ?
> > > 
> > > Yes that's a gap. We should probably bail out and reexecute the original
> > > instruction. The VE handler would need to set a flag for that.
> 
> No, NMI cannot happen here.  The TDX-Module "blocks" NMIs until the #VE info is
> consumed by the guest.

'cute', might be useful to have that mentioned somewhere.

> > > Or alternatively the NMI always gets the VE information and puts
> > > it on some internal stack, but that would seem clunkier.
> > 
> > The same is possible with MCE and #DB I imagine.
> 
> The MCE "architecture" for a TDX guest is rather stupid.  The guest is required
> to keep CR4.MCE=1, but at least for TDX 1.0 the VMM is not allowed to inject #MC.
> So, for better or worse, #MC is a non-issue.
> 
> #VE->#DB->#VE would be an issue, presumably this needs to be noinstr (or whatever
> it is that prevents #DBs on functions).

Ah, it is that already ofcourse, so yeah #DB can't happen here.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 09/26] x86/tdx: Handle CPUID via #VE
  2021-02-07 22:45             ` Andy Lutomirski
@ 2021-02-08 17:10               ` Sean Christopherson
  2021-02-08 17:35                 ` Andy Lutomirski
  2021-03-18 21:30               ` [PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions Kuppuswamy Sathyanarayanan
  1 sibling, 1 reply; 161+ messages in thread
From: Sean Christopherson @ 2021-02-08 17:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, LKML

On Sun, Feb 07, 2021, Andy Lutomirski wrote:
> 
> > On Feb 7, 2021, at 2:31 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> > 
> > On 2/7/21 12:29 PM, Kirill A. Shutemov wrote:
> >>> Couldn't you just have one big helper that takes *all* the registers
> >>> that get used in any TDVMCALL and sets all the rcx bits?  The users
> >>> could just pass 0's for the things they don't use.

IIRC, having exactly one helper is a big mess (my original code used a single
main helper).  CPUID has a large number of outputs, and because outputs are
handled as pointers, the assembly routine needs to check output params for NULL.

And if we want to write up port I/O directly to TDVMCALL to avoid the #VE, IN
and OUT need separate helpers to implement a non-standard register ABI in order
to play nice with ALTERANTIVES.

This also has my vote, mainly because gcc doesn't allow directly specifying
r8-r15 as register constraints to inline asm.  That creates a nasty hole where
a register can get corrupted if code is inserted between writing the local
variable and passing it to the inline asm.

Case in point, patch 08 has this exact bug.

> +static u64 tdx_read_msr_safe(unsigned int msr, int *err)
> +{
> +       register long r10 asm("r10") = TDVMCALL_STANDARD;
> +       register long r11 asm("r11") = EXIT_REASON_MSR_READ;
> +       register long r12 asm("r12") = msr;
> +       register long rcx asm("rcx");
> +       long ret;
> +
> +       WARN_ON_ONCE(tdx_is_context_switched_msr(msr));

This can corrupt r10, r11 and/or r12, e.g. if tdx_is_context_switched_msr() is
not inlined, it can use r10 and r11 as scratch registers, and gcc isn't smart
enough to know it needs to save registers before the call.

Even if the code as committed is guaranteed to work, IMO this approach is
hostile toward future debuggers/developers, e.g. adding a printk() in here to
debug can introduce a completely different failure.

> +
> +       if (msr == MSR_CSTAR)
> +               return 0;
> +
> +       /* Allow to pass R10, R11 and R12 down to the VMM */
> +       rcx = BIT(10) | BIT(11) | BIT(12);
> +
> +       asm volatile(TDCALL
> +                       : "=a"(ret), "=r"(r10), "=r"(r11), "=r"(r12)
> +                       : "a"(TDVMCALL), "r"(rcx), "r"(r10), "r"(r11), "r"(r12)
> +                       : );
> +
> +       /* XXX: Better error handling needed? */
> +       *err = (ret || r10) ? -EIO : 0;
> +
> +       return r11;
> +}

> >>> Then you've got the ugly inline asm in one place.  It also makes it
> >>> harder to screw up the 'rcx' mask and end up passing registers you
> >>> didn't want into a malicious VMM.
> >> For now we only pass down R10-R15, but the interface allows to pass down
> >> much wider set of registers, including XMM. How far do we want to get it?
> > 
> > Just do what we immediately need: R10-R15
> > .
> > 
> 
> How much of the register state is revealed to the VMM when we do a TDVMCALL?
> Presumably we should fully sanitize all register state that shows up in
> cleartext on the other end, and we should treat all regs that can be modified
> by the VMM as clobbered.

The guest gets to choose, with a few restrictions.  RSP cannot be exposed to the
host.  RAX, RCX, R10, and R11 are always exposed as they hold mandatory info
about the TDVMCALL (TDCALL fn, GPR mask, GHCI vs. vendor, and TDVMCALL fn).  All
other GPRs are exposed and clobbered if their bit in RCX is set, otherwise they
are saved/restored by the TDX-Module.

I agree with Dave, pass everything required by the GHCI in the main routine, and
sanitize and save/restore all such GPRs.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 09/26] x86/tdx: Handle CPUID via #VE
  2021-02-08 17:10               ` Sean Christopherson
@ 2021-02-08 17:35                 ` Andy Lutomirski
  2021-02-08 17:47                   ` Sean Christopherson
  0 siblings, 1 reply; 161+ messages in thread
From: Andy Lutomirski @ 2021-02-08 17:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, LKML

On Mon, Feb 8, 2021 at 9:11 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Sun, Feb 07, 2021, Andy Lutomirski wrote:
> >

> > How much of the register state is revealed to the VMM when we do a TDVMCALL?
> > Presumably we should fully sanitize all register state that shows up in
> > cleartext on the other end, and we should treat all regs that can be modified
> > by the VMM as clobbered.
>
> The guest gets to choose, with a few restrictions.  RSP cannot be exposed to the
> host.  RAX, RCX, R10, and R11 are always exposed as they hold mandatory info
> about the TDVMCALL (TDCALL fn, GPR mask, GHCI vs. vendor, and TDVMCALL fn).  All
> other GPRs are exposed and clobbered if their bit in RCX is set, otherwise they
> are saved/restored by the TDX-Module.
>
> I agree with Dave, pass everything required by the GHCI in the main routine, and
> sanitize and save/restore all such GPRs.

Sounds okay to me.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 09/26] x86/tdx: Handle CPUID via #VE
  2021-02-08 17:35                 ` Andy Lutomirski
@ 2021-02-08 17:47                   ` Sean Christopherson
  0 siblings, 0 replies; 161+ messages in thread
From: Sean Christopherson @ 2021-02-08 17:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Kirill A. Shutemov, Andy Lutomirski,
	Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, LKML

On Mon, Feb 08, 2021, Andy Lutomirski wrote:
> On Mon, Feb 8, 2021 at 9:11 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Sun, Feb 07, 2021, Andy Lutomirski wrote:
> > >
> 
> > > How much of the register state is revealed to the VMM when we do a TDVMCALL?
> > > Presumably we should fully sanitize all register state that shows up in
> > > cleartext on the other end, and we should treat all regs that can be modified
> > > by the VMM as clobbered.
> >
> > The guest gets to choose, with a few restrictions.  RSP cannot be exposed to the
> > host.  RAX, RCX, R10, and R11 are always exposed as they hold mandatory info
> > about the TDVMCALL (TDCALL fn, GPR mask, GHCI vs. vendor, and TDVMCALL fn).  All
> > other GPRs are exposed and clobbered if their bit in RCX is set, otherwise they
> > are saved/restored by the TDX-Module.
> >
> > I agree with Dave, pass everything required by the GHCI in the main routine, and
> > sanitize and save/restore all such GPRs.
> 
> Sounds okay to me.

One clarification: only non-volatile GPRs (R12-R15) need to be saved/restored.

And I think it makes sense to sanitize any exposed GPRs (that don't hold an
output value) after TDVMCALL to avoid speculating with a host-controlled value.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-08 16:59           ` Peter Zijlstra
@ 2021-02-08 19:05             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-02-08 19:05 UTC (permalink / raw)
  To: Peter Zijlstra, Sean Christopherson
  Cc: Andi Kleen, Andy Lutomirski, Dave Hansen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel, Sean Christopherson



On 2/8/21 8:59 AM, Peter Zijlstra wrote:
> 'cute', might be useful to have that mentioned somewhere.
we will add a note for it in comments.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 04/26] x86/tdx: Get TD execution environment information via TDINFO
  2021-02-08 10:00   ` Peter Zijlstra
@ 2021-02-08 19:10     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-02-08 19:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Dave Hansen, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel



On 2/8/21 2:00 AM, Peter Zijlstra wrote:
> This needs a binutils version number.
Yes, we will add it in next version.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-05 23:38 ` [RFC v1 05/26] x86/traps: Add #VE support for TDX guest Kuppuswamy Sathyanarayanan
  2021-02-08 10:20   ` Peter Zijlstra
@ 2021-02-12 19:20   ` Dave Hansen
  2021-02-12 19:47   ` Andy Lutomirski
  2 siblings, 0 replies; 161+ messages in thread
From: Dave Hansen @ 2021-02-12 19:20 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> More details on cases where #VE exceptions are allowed/not-allowed:
> 
> The #VE exception do not occur in the paranoid entry paths, like NMIs.
> While other operations during an NMI might cause #VE, these are in the
> NMI code that can handle nesting, so there is no concern about
> reentrancy. This is similar to how #PF is handled in NMIs.
> 
> The #VE exception also cannot happen in entry/exit code with the
> wrong gs, such as the SWAPGS code, so it's entry point does not
> need "paranoid" handling.

Considering:

https://lore.kernel.org/lkml/20200825171903.GA20660@sjchrist-ice/

I would suggest revisiting this part of the changelog.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-05 23:38 ` [RFC v1 05/26] x86/traps: Add #VE support for TDX guest Kuppuswamy Sathyanarayanan
  2021-02-08 10:20   ` Peter Zijlstra
  2021-02-12 19:20   ` Dave Hansen
@ 2021-02-12 19:47   ` Andy Lutomirski
  2021-02-12 20:06     ` Sean Christopherson
  2 siblings, 1 reply; 161+ messages in thread
From: Andy Lutomirski @ 2021-02-12 19:47 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, LKML, Sean Christopherson

On Fri, Feb 5, 2021 at 3:39 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> The TDX module injects #VE exception to the guest TD in cases of
> disallowed instructions, disallowed MSR accesses and subset of CPUID
> leaves. Also, it's theoretically possible for CPU to inject #VE
> exception on EPT violation, but the TDX module makes sure this does
> not happen, as long as all memory used is properly accepted using
> TDCALLs.

By my very cursory reading of the TDX arch specification 9.8.2,
"Secure" EPT violations don't send #VE.  But the docs are quite
unclear, or at least the docs I found are.  What happens if the guest
attempts to access a secure GPA that is not ACCEPTed?  For example,
suppose the VMM does THH.MEM.PAGE.REMOVE on a secure address and the
guest accesses it, via instruction fetch or data access.  What
happens?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-12 19:47   ` Andy Lutomirski
@ 2021-02-12 20:06     ` Sean Christopherson
  2021-02-12 20:17       ` Dave Hansen
  2021-02-12 20:20       ` Andy Lutomirski
  0 siblings, 2 replies; 161+ messages in thread
From: Sean Christopherson @ 2021-02-12 20:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Dave Hansen,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, LKML, Sean Christopherson

On Fri, Feb 12, 2021, Andy Lutomirski wrote:
> On Fri, Feb 5, 2021 at 3:39 PM Kuppuswamy Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> >
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > The TDX module injects #VE exception to the guest TD in cases of
> > disallowed instructions, disallowed MSR accesses and subset of CPUID
> > leaves. Also, it's theoretically possible for CPU to inject #VE
> > exception on EPT violation, but the TDX module makes sure this does
> > not happen, as long as all memory used is properly accepted using
> > TDCALLs.
> 
> By my very cursory reading of the TDX arch specification 9.8.2,
> "Secure" EPT violations don't send #VE.  But the docs are quite
> unclear, or at least the docs I found are.

The version I have also states that SUPPRESS_VE is always set.  So either there
was a change in direction, or the public docs need to be updated.  Lazy accept
requires a #VE, either from hardware or from the module.  The latter would
require walking the Secure EPT tables on every EPT violation...

> What happens if the guest attempts to access a secure GPA that is not
> ACCEPTed?  For example, suppose the VMM does THH.MEM.PAGE.REMOVE on a secure
> address and the guest accesses it, via instruction fetch or data access.
> What happens?

Well, as currently written in the spec, it will generate an EPT violation and
the host will have no choice but to kill the guest.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-12 20:06     ` Sean Christopherson
@ 2021-02-12 20:17       ` Dave Hansen
  2021-02-12 20:37         ` Sean Christopherson
  2021-02-12 20:20       ` Andy Lutomirski
  1 sibling, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-02-12 20:17 UTC (permalink / raw)
  To: Sean Christopherson, Andy Lutomirski
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, LKML, Sean Christopherson

On 2/12/21 12:06 PM, Sean Christopherson wrote:
>> What happens if the guest attempts to access a secure GPA that is not
>> ACCEPTed?  For example, suppose the VMM does THH.MEM.PAGE.REMOVE on a secure
>> address and the guest accesses it, via instruction fetch or data access.
>> What happens?
> Well, as currently written in the spec, it will generate an EPT violation and
> the host will have no choice but to kill the guest.

That's actually perfect behavior from my perspective.  Host does
something stupid.  Host gets left holding the pieces.  No enabling to do
in the guest.

This doesn't *preclude* the possibility that the VMM and guest could
establish a protocol to remove guest pages.  It just means that the host
can't go it alone and that if they guest and host get out of sync, the
guest dies.

In other words, I think I'm rooting for the docs, as written. :)

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-12 20:06     ` Sean Christopherson
  2021-02-12 20:17       ` Dave Hansen
@ 2021-02-12 20:20       ` Andy Lutomirski
  2021-02-12 20:44         ` Sean Christopherson
  1 sibling, 1 reply; 161+ messages in thread
From: Andy Lutomirski @ 2021-02-12 20:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Dave Hansen, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok, LKML,
	Sean Christopherson


> On Feb 12, 2021, at 12:06 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Fri, Feb 12, 2021, Andy Lutomirski wrote:
>>> On Fri, Feb 5, 2021 at 3:39 PM Kuppuswamy Sathyanarayanan
>>> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>>> 
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>> 
>>> The TDX module injects #VE exception to the guest TD in cases of
>>> disallowed instructions, disallowed MSR accesses and subset of CPUID
>>> leaves. Also, it's theoretically possible for CPU to inject #VE
>>> exception on EPT violation, but the TDX module makes sure this does
>>> not happen, as long as all memory used is properly accepted using
>>> TDCALLs.
>> 
>> By my very cursory reading of the TDX arch specification 9.8.2,
>> "Secure" EPT violations don't send #VE.  But the docs are quite
>> unclear, or at least the docs I found are.
> 
> The version I have also states that SUPPRESS_VE is always set.  So either there
> was a change in direction, or the public docs need to be updated.  Lazy accept
> requires a #VE, either from hardware or from the module.  The latter would
> require walking the Secure EPT tables on every EPT violation...
> 
>> What happens if the guest attempts to access a secure GPA that is not
>> ACCEPTed?  For example, suppose the VMM does THH.MEM.PAGE.REMOVE on a secure
>> address and the guest accesses it, via instruction fetch or data access.
>> What happens?
> 
> Well, as currently written in the spec, it will generate an EPT violation and
> the host will have no choice but to kill the guest.

Or page the page back in and try again?

In regular virt guests, if the host pages out a guest page, it’s the host’s job to put it back when needed. In paravirt, a well designed async of protocol can sometimes let the guest to useful work when this happens. If a guest (or bare metal) has its memory hot removed (via balloon or whatever) and the kernel messes up and accesses removed memory, the guest (or bare metal) is toast.

I don’t see why TDX needs to be any different.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-12 20:17       ` Dave Hansen
@ 2021-02-12 20:37         ` Sean Christopherson
  2021-02-12 20:46           ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Sean Christopherson @ 2021-02-12 20:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, LKML, Sean Christopherson

On Fri, Feb 12, 2021, Dave Hansen wrote:
> On 2/12/21 12:06 PM, Sean Christopherson wrote:
> >> What happens if the guest attempts to access a secure GPA that is not
> >> ACCEPTed?  For example, suppose the VMM does THH.MEM.PAGE.REMOVE on a secure
> >> address and the guest accesses it, via instruction fetch or data access.
> >> What happens?
> > Well, as currently written in the spec, it will generate an EPT violation and
> > the host will have no choice but to kill the guest.
> 
> That's actually perfect behavior from my perspective.  Host does
> something stupid.  Host gets left holding the pieces.  No enabling to do
> in the guest.
> 
> This doesn't *preclude* the possibility that the VMM and guest could
> establish a protocol to remove guest pages.  It just means that the host
> can't go it alone and that if they guest and host get out of sync, the
> guest dies.
> 
> In other words, I think I'm rooting for the docs, as written. :)

I tentatively agree that the host should not be able to remove pages without
guest approval, but that's not the only use case for #VE on EPT violations.
It's not even really an intended use case.

There needs to be a mechanism for lazy/deferred/on-demand acceptance of pages.
E.g. pre-accepting every page in a VM with hundreds of GB of memory will be
ridiculously slow.

#VE is the best option to do that:

  - Relatively sane re-entrancy semantics.
  - Hardware accelerated.
  - Doesn't require stealing an IRQ from the guest.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-12 20:20       ` Andy Lutomirski
@ 2021-02-12 20:44         ` Sean Christopherson
  0 siblings, 0 replies; 161+ messages in thread
From: Sean Christopherson @ 2021-02-12 20:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Dave Hansen, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok, LKML,
	Sean Christopherson

On Fri, Feb 12, 2021, Andy Lutomirski wrote:
> 
> > On Feb 12, 2021, at 12:06 PM, Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Fri, Feb 12, 2021, Andy Lutomirski wrote:
> >>> On Fri, Feb 5, 2021 at 3:39 PM Kuppuswamy Sathyanarayanan
> >>> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> >>> 
> >>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >>> 
> >>> The TDX module injects #VE exception to the guest TD in cases of
> >>> disallowed instructions, disallowed MSR accesses and subset of CPUID
> >>> leaves. Also, it's theoretically possible for CPU to inject #VE
> >>> exception on EPT violation, but the TDX module makes sure this does
> >>> not happen, as long as all memory used is properly accepted using
> >>> TDCALLs.
> >> 
> >> By my very cursory reading of the TDX arch specification 9.8.2,
> >> "Secure" EPT violations don't send #VE.  But the docs are quite
> >> unclear, or at least the docs I found are.
> > 
> > The version I have also states that SUPPRESS_VE is always set.  So either there
> > was a change in direction, or the public docs need to be updated.  Lazy accept
> > requires a #VE, either from hardware or from the module.  The latter would
> > require walking the Secure EPT tables on every EPT violation...
> > 
> >> What happens if the guest attempts to access a secure GPA that is not
> >> ACCEPTed?  For example, suppose the VMM does THH.MEM.PAGE.REMOVE on a secure
> >> address and the guest accesses it, via instruction fetch or data access.
> >> What happens?
> > 
> > Well, as currently written in the spec, it will generate an EPT violation and
> > the host will have no choice but to kill the guest.
> 
> Or page the page back in and try again?

The intended use isn't for swapping a page or migrating a page.  Those flows
have dedicated APIs, and do not _remove_ a page.

E.g. the KVM RFC patches already support zapping Secure EPT entries if NUMA
balancing kicks in.  But, in TDX terminology, that is a BLOCK/UNBLOCK operation.

Removal is for converting a private page to a shared page, and for paravirt
memory ballooning.

> In regular virt guests, if the host pages out a guest page, it’s the host’s
> job to put it back when needed. In paravirt, a well designed async of
> protocol can sometimes let the guest to useful work when this happens. If a
> guest (or bare metal) has its memory hot removed (via balloon or whatever)
> and the kernel messes up and accesses removed memory, the guest (or bare
> metal) is toast.
> 
> I don’t see why TDX needs to be any different.

The REMOVE API isn't intended for swap.  In fact, it can't be used for swap. If
a page is removed, its contents are lost.  Because the original contents are
lost, the guest is required to re-accept the page so that the host can't
silently get the guest to consume a zero page that the guest thinks has valid
data.

For swap, the contents are preserved, and so explicit re-acceptance is not
required.  From the guest's perspective, it's really just a high-latency memory
access.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-12 20:37         ` Sean Christopherson
@ 2021-02-12 20:46           ` Dave Hansen
  2021-02-12 20:54             ` Sean Christopherson
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-02-12 20:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, LKML, Sean Christopherson

On 2/12/21 12:37 PM, Sean Christopherson wrote:
> There needs to be a mechanism for lazy/deferred/on-demand acceptance of pages.
> E.g. pre-accepting every page in a VM with hundreds of GB of memory will be
> ridiculously slow.
> 
> #VE is the best option to do that:
> 
>   - Relatively sane re-entrancy semantics.
>   - Hardware accelerated.
>   - Doesn't require stealing an IRQ from the guest.

TDX already provides a basic environment for the guest when it starts
up.  The guest has some known, good memory.  The guest also has a very,
very clear understanding of which physical pages it uses and when.  It's
staged, of course, as decompression happens and the guest comes up.

But, the guest still knows which guest physical pages it accesses and
when.  It doesn't need on-demand faulting in of non-accepted pages.  It
can simply decline to expose non-accepted pages to the wider system
before they've been accepted.

It would be nuts to merrily free non-accepted pages into the page
allocator and handle the #VE fallout as they're touched from
god-knows-where.

I don't see *ANY* case for #VE to occur inside the guest kernel, outside
of *VERY* narrow places like copy_from_user().  Period.  #VE from ring-0
is not OK.

So, no, #VE is not the best option.  No #VE's in the first place is the
best option.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-12 20:46           ` Dave Hansen
@ 2021-02-12 20:54             ` Sean Christopherson
  2021-02-12 21:06               ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Sean Christopherson @ 2021-02-12 20:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, LKML, Sean Christopherson

On Fri, Feb 12, 2021, Dave Hansen wrote:
> On 2/12/21 12:37 PM, Sean Christopherson wrote:
> > There needs to be a mechanism for lazy/deferred/on-demand acceptance of pages.
> > E.g. pre-accepting every page in a VM with hundreds of GB of memory will be
> > ridiculously slow.
> > 
> > #VE is the best option to do that:
> > 
> >   - Relatively sane re-entrancy semantics.
> >   - Hardware accelerated.
> >   - Doesn't require stealing an IRQ from the guest.
> 
> TDX already provides a basic environment for the guest when it starts
> up.  The guest has some known, good memory.  The guest also has a very,
> very clear understanding of which physical pages it uses and when.  It's
> staged, of course, as decompression happens and the guest comes up.
> 
> But, the guest still knows which guest physical pages it accesses and
> when.  It doesn't need on-demand faulting in of non-accepted pages.  It
> can simply decline to expose non-accepted pages to the wider system
> before they've been accepted.
> 
> It would be nuts to merrily free non-accepted pages into the page
> allocator and handle the #VE fallout as they're touched from
> god-knows-where.
> 
> I don't see *ANY* case for #VE to occur inside the guest kernel, outside
> of *VERY* narrow places like copy_from_user().  Period.  #VE from ring-0
> is not OK.
> 
> So, no, #VE is not the best option.  No #VE's in the first place is the
> best option.

Ah, I see what you're thinking.

Treating an EPT #VE as fatal was also considered as an option.  IIUC it was
thought that finding every nook and cranny that could access a page, without
forcing the kernel to pre-accept huge swaths of memory, would be very difficult.
It'd be wonderful if that's not the case.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-12 20:54             ` Sean Christopherson
@ 2021-02-12 21:06               ` Dave Hansen
  2021-02-12 21:37                 ` Sean Christopherson
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-02-12 21:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, LKML, Sean Christopherson

On 2/12/21 12:54 PM, Sean Christopherson wrote:
> Ah, I see what you're thinking.
> 
> Treating an EPT #VE as fatal was also considered as an option.  IIUC it was
> thought that finding every nook and cranny that could access a page, without
> forcing the kernel to pre-accept huge swaths of memory, would be very difficult.
> It'd be wonderful if that's not the case.

We have to manually set up the page table entries for every physical
page of memory (except for the hard-coded early stuff below 8MB or
whatever).  We *KNOW*, 100% before physical memory is accessed.

There aren't nooks and crannies where memory is accessed.  There are a
few, very well-defined choke points which must be crossed before memory
is accessed.  Page table creation, bootmem and the core page allocator
come to mind.

If Linux doesn't have a really good handle on which physical pages are
accessed when, we've got bigger problems on our hands.  Remember, we
even have debugging mechanisms that unmap pages from the kernel when
they're in the allocator.  We know so well that nobody is accessing
those physical addresses that we even tell hypervisors they can toss the
page contents and remove the physical backing (guest free page hinting).

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-12 21:06               ` Dave Hansen
@ 2021-02-12 21:37                 ` Sean Christopherson
  2021-02-12 21:47                   ` Andy Lutomirski
  0 siblings, 1 reply; 161+ messages in thread
From: Sean Christopherson @ 2021-02-12 21:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, LKML, Sean Christopherson

On Fri, Feb 12, 2021, Dave Hansen wrote:
> On 2/12/21 12:54 PM, Sean Christopherson wrote:
> > Ah, I see what you're thinking.
> > 
> > Treating an EPT #VE as fatal was also considered as an option.  IIUC it was
> > thought that finding every nook and cranny that could access a page, without
> > forcing the kernel to pre-accept huge swaths of memory, would be very difficult.
> > It'd be wonderful if that's not the case.
> 
> We have to manually set up the page table entries for every physical
> page of memory (except for the hard-coded early stuff below 8MB or
> whatever).  We *KNOW*, 100% before physical memory is accessed.
> 
> There aren't nooks and crannies where memory is accessed.  There are a
> few, very well-defined choke points which must be crossed before memory
> is accessed.  Page table creation, bootmem and the core page allocator
> come to mind.

Heh, for me, that's two places too many beyond my knowledge domain to feel
comfortable putting a stake in the ground saying #VE isn't necessary.

Joking aside, I agree that treating EPT #VEs as fatal would be ideal, but from a
TDX architecture perspective, when considering all possible kernels, drivers,
configurations, etc..., it's risky to say that there will _never_ be a scenario
that "requires" #VE.

What about adding a property to the TD, e.g. via a flag set during TD creation,
that controls whether unaccepted accesses cause #VE or are, for all intents and
purposes, fatal?  That would allow Linux to pursue treating EPT #VEs for private
GPAs as fatal, but would give us a safety and not prevent others from utilizing
#VEs.

I suspect it would also be helpful for debug, e.g. if the kernel manages to do
something stupid and maps memory it hasn't accepted, in which case debugging a
#VE in the guest is likely easier than an opaque EPT violation in the host.

> If Linux doesn't have a really good handle on which physical pages are
> accessed when, we've got bigger problems on our hands.  Remember, we
> even have debugging mechanisms that unmap pages from the kernel when
> they're in the allocator.  We know so well that nobody is accessing
> those physical addresses that we even tell hypervisors they can toss the
> page contents and remove the physical backing (guest free page hinting).

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-12 21:37                 ` Sean Christopherson
@ 2021-02-12 21:47                   ` Andy Lutomirski
  2021-02-12 21:48                     ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Andy Lutomirski @ 2021-02-12 21:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, Andy Lutomirski, Kuppuswamy Sathyanarayanan,
	Peter Zijlstra, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok, LKML,
	Sean Christopherson

On Fri, Feb 12, 2021 at 1:37 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Feb 12, 2021, Dave Hansen wrote:
> > On 2/12/21 12:54 PM, Sean Christopherson wrote:
> > > Ah, I see what you're thinking.
> > >
> > > Treating an EPT #VE as fatal was also considered as an option.  IIUC it was
> > > thought that finding every nook and cranny that could access a page, without
> > > forcing the kernel to pre-accept huge swaths of memory, would be very difficult.
> > > It'd be wonderful if that's not the case.
> >
> > We have to manually set up the page table entries for every physical
> > page of memory (except for the hard-coded early stuff below 8MB or
> > whatever).  We *KNOW*, 100% before physical memory is accessed.
> >
> > There aren't nooks and crannies where memory is accessed.  There are a
> > few, very well-defined choke points which must be crossed before memory
> > is accessed.  Page table creation, bootmem and the core page allocator
> > come to mind.
>
> Heh, for me, that's two places too many beyond my knowledge domain to feel
> comfortable putting a stake in the ground saying #VE isn't necessary.
>
> Joking aside, I agree that treating EPT #VEs as fatal would be ideal, but from a
> TDX architecture perspective, when considering all possible kernels, drivers,
> configurations, etc..., it's risky to say that there will _never_ be a scenario
> that "requires" #VE.
>
> What about adding a property to the TD, e.g. via a flag set during TD creation,
> that controls whether unaccepted accesses cause #VE or are, for all intents and
> purposes, fatal?  That would allow Linux to pursue treating EPT #VEs for private
> GPAs as fatal, but would give us a safety and not prevent others from utilizing
> #VEs.

That seems reasonable.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-12 21:47                   ` Andy Lutomirski
@ 2021-02-12 21:48                     ` Dave Hansen
  2021-02-14 19:33                       ` Andi Kleen
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-02-12 21:48 UTC (permalink / raw)
  To: Andy Lutomirski, Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, LKML, Sean Christopherson

On 2/12/21 1:47 PM, Andy Lutomirski wrote:
>> What about adding a property to the TD, e.g. via a flag set during TD creation,
>> that controls whether unaccepted accesses cause #VE or are, for all intents and
>> purposes, fatal?  That would allow Linux to pursue treating EPT #VEs for private
>> GPAs as fatal, but would give us a safety and not prevent others from utilizing
>> #VEs.
> That seems reasonable.

Ditto.

We first need to double check to see if the docs are right, though.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-12 21:48                     ` Dave Hansen
@ 2021-02-14 19:33                       ` Andi Kleen
  2021-02-14 19:54                         ` Andy Lutomirski
  0 siblings, 1 reply; 161+ messages in thread
From: Andi Kleen @ 2021-02-14 19:33 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Sean Christopherson, Kuppuswamy Sathyanarayanan,
	Peter Zijlstra, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, LKML, Sean Christopherson

On Fri, Feb 12, 2021 at 01:48:36PM -0800, Dave Hansen wrote:
> On 2/12/21 1:47 PM, Andy Lutomirski wrote:
> >> What about adding a property to the TD, e.g. via a flag set during TD creation,
> >> that controls whether unaccepted accesses cause #VE or are, for all intents and
> >> purposes, fatal?  That would allow Linux to pursue treating EPT #VEs for private
> >> GPAs as fatal, but would give us a safety and not prevent others from utilizing
> >> #VEs.
> > That seems reasonable.
> 
> Ditto.
> 
> We first need to double check to see if the docs are right, though.

I confirmed with the TDX module owners that #VE can only happen for:
- unaccepted pages
- instructions like MSR access or CPUID
- specific instructions that are no in the syscall gap

Also if there are future asynchronous #VEs they would only happen
with IF=1, which would also protect the gap.

So no need to make #VE an IST.

-Andi


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
  2021-02-14 19:33                       ` Andi Kleen
@ 2021-02-14 19:54                         ` Andy Lutomirski
  0 siblings, 0 replies; 161+ messages in thread
From: Andy Lutomirski @ 2021-02-14 19:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Andy Lutomirski, Sean Christopherson,
	Kuppuswamy Sathyanarayanan, Peter Zijlstra, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok, LKML,
	Sean Christopherson

On Sun, Feb 14, 2021 at 11:33 AM Andi Kleen <ak@linux.intel.com> wrote:
>
> On Fri, Feb 12, 2021 at 01:48:36PM -0800, Dave Hansen wrote:
> > On 2/12/21 1:47 PM, Andy Lutomirski wrote:
> > >> What about adding a property to the TD, e.g. via a flag set during TD creation,
> > >> that controls whether unaccepted accesses cause #VE or are, for all intents and
> > >> purposes, fatal?  That would allow Linux to pursue treating EPT #VEs for private
> > >> GPAs as fatal, but would give us a safety and not prevent others from utilizing
> > >> #VEs.
> > > That seems reasonable.
> >
> > Ditto.
> >
> > We first need to double check to see if the docs are right, though.
>
> I confirmed with the TDX module owners that #VE can only happen for:
> - unaccepted pages

Can the hypervisor cause an already-accepted secure-EPT page to
transition to the unaccepted state?  If so, NAK.  Sorry, upstream
Linux does not need yet more hacks to make it kind-of-sort-of work on
the broken x86 exception architecture, especially for a feature that
is marketed for security.

As I understand it, the entire point of the TDX modular design is to
make it possible to fix at least some amount of architectural error
without silicon revisions.  If it is indeed the case that an access to
an unaccepted secure-EPT page will cause #VE, then Intel needs to take
the following actions:

1. Update the documentation to make the behavior comprehensible to
mere mortals.  Right now, this information appears to exist in the
form of emails and is, as far as I can tell, not present in the
documentation in a way that we can understand.  Keep in mind that this
discussion includes a number of experts on the software aspects of the
x86 architecture, and the fact that none of us who don't work for
Intel can figure out, authoritatively, what the spec is trying to tell
us should be a huge red flag.

2. Fix the architecture.  Barring some unexpected discovery, some
highly compelling reason, or a design entailing a number of
compromises that will, frankly, be rather embarrassing, upstream Linux
will not advertise itself as a secure implementation of a TDX guest
with the architecture in its current state.  If you would like Linux
to print a giant message along the lines of "WARNING: The TDX
architecture is defective and, as a result, your system is vulnerable
to compromise attack by a malicious hypervisor that uses the
TDH.PAGE.MEM.REMOVE operation.  The advertised security properties of
the Intel TDX architecture are not available.  Use TDX at your own
risk.", we could consider that.  I think it would look pretty bad.

3. Engage with the ISV community, including Linux, to meaningfully
review new Intel designs for software usability.  Meaningful review
does not mean that you send us a spec, we tell you that it's broken,
and you ship it anyway.  Meaningful review also means that the
questions that the software people ask you need to be answered in a
public, authoritative location, preferably the primary spec publicly
available at Intel's website.  Emails don't count for this purpose.

There is no particular shortage of CVEs of varying degrees of severity
due to nonsensical warts in the x86 architecture causing CPL 0 kernels
to malfunction and become subject to privilege escalation.  We are
telling you loud and clear that the current TDX architecture appears
to be a minefield and that it is *specifically* vulnerable to an
attack in which a page accessed early in SYSCALL path (or late in the
SYSRET path) causes #VE. You need to take this seriously.

--Andy

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions
  2021-02-07 22:45             ` Andy Lutomirski
  2021-02-08 17:10               ` Sean Christopherson
@ 2021-03-18 21:30               ` Kuppuswamy Sathyanarayanan
  2021-03-19 16:55                 ` Sean Christopherson
  1 sibling, 1 reply; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-03-18 21:30 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Implement common helper functions to communicate with
the TDX Module and VMM (using TDCALL instruction).

tdvmcall() function can be used to request services
from VMM.

tdcall() function can be used to communicate with the
TDX Module.

Using common helper functions makes the code more readable
and less error prone compared to distributed and use case
specific inline assembly code. Only downside in using this
approach is, it adds a few extra instructions for every
TDCALL use case when compared to distributed checks. Although
it's a bit less efficient, it's worth it to make the code more
readable.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Hi All,

As you have suggested, I have created common helper functions
for all tdcall() and tdvmcall() use cases. It uses inline
assembly and passes GPRs R8-15 and r[a-c]x registers to TDX
Module/VMM. Please take a look at it and let me know your
comments. If you agree with the design, I can re-submit the
patchset with changes related to using these new APIs. Please
let me know.

 arch/x86/include/asm/tdx.h | 27 ++++++++++++++++++++
 arch/x86/kernel/tdx.c      | 52 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 0b9d571b1f95..311252a90cfb 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -3,8 +3,27 @@
 #ifndef _ASM_X86_TDX_H
 #define _ASM_X86_TDX_H
 
+#include <linux/types.h>
+
 #define TDX_CPUID_LEAF_ID	0x21
 
+#define TDVMCALL		0
+
+/* TDVMCALL R10 Input */
+#define TDVMCALL_STANDARD	0
+
+/*
+ * TDCALL instruction is newly added in TDX architecture,
+ * used by TD for requesting the host VMM to provide
+ * (untrusted) services. Supported in Binutils >= 2.36
+ */
+#define TDCALL	".byte 0x66,0x0f,0x01,0xcc"
+
+struct tdcall_regs {
+	u64 rax, rcx, rdx;
+	u64 r8, r9, r10, r11, r12, r13, r14, r15;
+};
+
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 /* Common API to check TDX support in decompression and common kernel code. */
@@ -12,6 +31,10 @@ bool is_tdx_guest(void);
 
 void __init tdx_early_init(void);
 
+void tdcall(u64 leafid, struct tdcall_regs *regs);
+
+void tdvmcall(u64 subid, struct tdcall_regs *regs);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
@@ -21,6 +44,10 @@ static inline bool is_tdx_guest(void)
 
 static inline void tdx_early_init(void) { };
 
+static inline void tdcall(u64 leafid, struct tdcall_regs *regs) { };
+
+static inline void tdvmcall(u64 subid, struct tdcall_regs *regs) { };
+
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e44e55d1e519..7ae1d25e272b 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -4,6 +4,58 @@
 #include <asm/tdx.h>
 #include <asm/cpufeature.h>
 
+void tdcall(u64 leafid, struct tdcall_regs *regs)
+{
+	asm volatile(
+			/* RAX = leafid (TDCALL LEAF ID) */
+			"  movq %0, %%rax;"
+			/* Move regs->r[*] data to regs r[a-c]x,  r8-r5 */
+			"  movq 8(%1), %%rcx;"
+			"  movq 16(%1), %%rdx;"
+			"  movq 24(%1), %%r8;"
+			"  movq 32(%1), %%r9;"
+			"  movq 40(%1), %%r10;"
+			"  movq 48(%1), %%r11;"
+			"  movq 56(%1), %%r12;"
+			"  movq 64(%1), %%r13;"
+			"  movq 72(%1), %%r14;"
+			"  movq 80(%1), %%r15;"
+			TDCALL ";"
+			/* Save TDCALL success/failure to regs->rax */
+			"  movq %%rax, (%1);"
+			/* Save rcx and rdx contents to regs->r[c-d]x */
+			"  movq %%rcx, 8(%1);"
+			"  movq %%rdx, 16(%1);"
+			/* Move content of registers R8-R15 regs->r[8-15] */
+			"  movq %%r8, 24(%1);"
+			"  movq %%r9, 32(%1);"
+			"  movq %%r10, 40(%1);"
+			"  movq %%r11, 48(%1);"
+			"  movq %%r12, 56(%1);"
+			"  movq %%r13, 64(%1);"
+			"  movq %%r14, 72(%1);"
+			"  movq %%r15, 80(%1);"
+
+		:
+		: "r" (leafid), "r" (regs)
+		: "memory", "rax", "rbx", "rcx", "rdx", "r8",
+		  "r9", "r10", "r11", "r12", "r13", "r14", "r15"
+		);
+
+}
+
+void tdvmcall(u64 subid, struct tdcall_regs *regs)
+{
+	/* Expose GPRs R8-R15 to VMM */
+	regs->rcx = 0xff00;
+	/* R10 = 0 (standard TDVMCALL) */
+	regs->r10 = TDVMCALL_STANDARD;
+	/* Save subid to r11 register */
+	regs->r11 = subid;
+
+	tdcall(TDVMCALL, regs);
+}
+
 static inline bool cpuid_has_tdx_guest(void)
 {
 	u32 eax, signature[3];
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions
  2021-03-18 21:30               ` [PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions Kuppuswamy Sathyanarayanan
@ 2021-03-19 16:55                 ` Sean Christopherson
  2021-03-19 17:42                   ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Sean Christopherson @ 2021-03-19 16:55 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, linux-kernel

On Thu, Mar 18, 2021, Kuppuswamy Sathyanarayanan wrote:
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index e44e55d1e519..7ae1d25e272b 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -4,6 +4,58 @@
>  #include <asm/tdx.h>
>  #include <asm/cpufeature.h>
>  
> +void tdcall(u64 leafid, struct tdcall_regs *regs)
> +{
> +	asm volatile(
> +			/* RAX = leafid (TDCALL LEAF ID) */
> +			"  movq %0, %%rax;"
> +			/* Move regs->r[*] data to regs r[a-c]x,  r8-r5 */
> +			"  movq 8(%1), %%rcx;"

I am super duper opposed to using inline asm.  Large blocks are hard to read,
and even harder to maintain.  E.g. the %1 usage falls apart if an output
constraint is added; that can be avoided by defining a local const/imm (I forget
what they're called), but it doesn't help readability.

> +			"  movq 16(%1), %%rdx;"
> +			"  movq 24(%1), %%r8;"
> +			"  movq 32(%1), %%r9;"
> +			"  movq 40(%1), %%r10;"
> +			"  movq 48(%1), %%r11;"
> +			"  movq 56(%1), %%r12;"
> +			"  movq 64(%1), %%r13;"
> +			"  movq 72(%1), %%r14;"
> +			"  movq 80(%1), %%r15;"

This is extremely unsafe, and wasteful.  Putting the onus on the caller to zero
out unused registers, with no mechanism to enforce/encourage doing so, makes it
likely that the kernel will leak information to the VMM, e.g. in the form of
stack data due to a partially initialized "regs".

And although TDVMCALL is anything but speedy, requiring multiple memory
operations just to set a single register is unnecessary.  Not to mention
several of these registers are never used in the GHCI-defined TDVMCALLs.  And,
since the caller defines the mask (which I also dislike), it's possible/likely
that many of these memory operations are wasteful even for registers that are
used by _some_ TDVMCALLs.  Unnecessary accesses are inevitable if we want a
common helper, but this is too much.

> +			TDCALL ";"
> +			/* Save TDCALL success/failure to regs->rax */
> +			"  movq %%rax, (%1);"
> +			/* Save rcx and rdx contents to regs->r[c-d]x */
> +			"  movq %%rcx, 8(%1);"
> +			"  movq %%rdx, 16(%1);"
> +			/* Move content of registers R8-R15 regs->r[8-15] */
> +			"  movq %%r8, 24(%1);"
> +			"  movq %%r9, 32(%1);"
> +			"  movq %%r10, 40(%1);"
> +			"  movq %%r11, 48(%1);"
> +			"  movq %%r12, 56(%1);"
> +			"  movq %%r13, 64(%1);"
> +			"  movq %%r14, 72(%1);"
> +			"  movq %%r15, 80(%1);"
> +
> +		:
> +		: "r" (leafid), "r" (regs)
> +		: "memory", "rax", "rbx", "rcx", "rdx", "r8",
> +		  "r9", "r10", "r11", "r12", "r13", "r14", "r15"

All these clobbers mean even more memory operations...

> +		);
> +
> +}
> +
> +void tdvmcall(u64 subid, struct tdcall_regs *regs)
> +{
> +	/* Expose GPRs R8-R15 to VMM */
> +	regs->rcx = 0xff00;
> +	/* R10 = 0 (standard TDVMCALL) */
> +	regs->r10 = TDVMCALL_STANDARD;
> +	/* Save subid to r11 register */
> +	regs->r11 = subid;
> +
> +	tdcall(TDVMCALL, regs);

This implies the caller is responsible for _all_ error checking.  The base
TDCALL should never fail; if it does, something is horribly wrong with TDX-Module
and panicking is absolutely the best option.

The users of this are going to be difficult to read as well since the parameters
are stuff into a struct instead of being passed to a function.

IMO, throwing the bulk of the code in a proper asm subroutine and handling only
the GHCI-defined TDVMCALLs is the way to go.  If/when a VMM comes along that
wants to enlighten Linux guests to work with non-GCHI TDVMCALLs, enhancing this
madness can be their problem.

Completely untested...

struct tdvmcall_output {
	u64 r12;
	u64 r13;
	u64 r14;
	u64 r15;
}

u64 __tdvmcall(u64 fn, u64 p0, u64 p1, u64 p2, u64 p3,
	       struct tdvmcall_output *out);

	/* Offset for fields in tdvmcall_output */
	OFFSET(TDVMCALL_r12, tdvmcall_output, r13);
	OFFSET(TDVMCALL_r13, tdvmcall_output, r13);
	OFFSET(TDVMCALL_r14, tdvmcall_output, r14);
	OFFSET(TDVMCALL_r15, tdvmcall_output, r15);

SYM_FUNC_START(__tdvmcall)
	FRAME_BEGIN

	/* Save/restore non-volatile GPRs that are exposed to the VMM. */
        push %r15
        push %r14
        push %r13
        push %r12

	/*
	 * 0    => RAX = TDCALL leaf
	 * 0    => R10 = standard vs. vendor
	 * RDI  => R11 = TDVMCALL function, e.g. exit reason
	 * RSI  => R12 = input param 0
	 * RDX  => R13 = input param 1
	 * RCX  => R14 = input param 2
	 * R8   => R15 = input param 3
	 * MASK => RCX = TDVMCALL register behavior
	 * R9   => N/A = output struct
	 */
	xor %eax, %eax
        xor %r10d, %r10d
	mov %rdi, %r11
	mov %rsi, %r12
	mov %rdx, %r13
	mov %rcx, %r14
	mov %r8,  %r15

        /*
	 * Expose R10 - R15, i.e. all GPRs that may be used by TDVMCALLs
	 * defined in the GHCI.  Note, RAX and RCX are consumed, but only by
	 * TDX-Module and so don't need to be listed in the mask.
	 */
        movl $0xfc00, %ecx

	tdcall

	/* Panic if TDCALL reports failure. */
	test %rax, %rax
	jnz 2f

	/* Propagate TDVMCALL success/failure to return value. */
	mov %r10, %rax

	/*
	 * On success, propagate TDVMCALL outputs values to the output struct,
	 * if an output struct is provided.
	 */
	test %rax, %rax
	jnz 1f
	test %r9, %r9
	jz 1f

	movq %r12, $TDVMCALL_r12(%r9)
	movq %r13, $TDVMCALL_r13(%r9)
	movq %r14, $TDVMCALL_r14(%r9)
	movq %r15, $TDVMCALL_r15(%r9)
1:
	/*
	 * Zero out registers exposed to the VMM to avoid speculative execution
	 * with VMM-controlled values.
	 */
        xor %r10d, %r10d
        xor %r11d, %r11d
        xor %r12d, %r12d
        xor %r13d, %r13d
        xor %r14d, %r14d
        xor %r15d, %r15d

	pop %r12
        pop %r13
        pop %r14
        pop %r15

	FRAME_END
	ret
2:
	ud2
SYM_FUNC_END(__tdvmcall)

/*
 * Wrapper for the semi-common case where errors are fatal and there is a
 * single output value.
 */
static inline u64 tdvmcall(u64 fn, u64 p0, u64 p1, u64 p2, u64 p3,
			   struct tdvmcall_output *out)
{
	struct tdvmcall_output out;
	u64 err;

	err = __tdvmcall(fn, p0, p1, p2, p3, &out);
	BUG_ON(err);

	return out.r11;
}

static void tdx_handle_cpuid(struct pt_regs *regs)
{
	struct tdvmcall_output out;
	u64 err;

	err = __tdvmcall(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0, &out);
	BUG_ON(err);

        regs->ax = out.r11;
        regs->bx = out.r12;
        regs->cx = out.r13;
        regs->dx = out.r14;
}

#define REG_MASK(size) ((1ULL << ((size) * 8)) - 1)

static void tdx_handle_io(struct pt_regs *regs, u32 exit_qual)
{
        u8 out = (exit_qual & 8) ? 0 : 1;
        u8 size = (exit_qual & 7) + 1;
        u16 port = exit_qual >> 16;
        u64 val;

        /* I/O strings ops are unrolled at build time. */
        BUG_ON(exit_qual & 0x10);

	if (!tdx_allowed_port(port))
		return;

        if (out)
                val = regs->ax & REG_MASK(size);
        else
                val = 0;

        val = tdvmcall(EXIT_REASON_IO_INSTRUCTION, port, size, out, val);
        if (!out) {
                /* The upper bits of *AX are preserved for 2 and 1 byte I/O. */
                if (size < 4)
                        val |= (regs->ax & ~REG_MASK(size));
                regs->ax = val;
        }4
}

static u64 tdx_read_msr_safe(unsigned int msr, int *ret)
{
	struct tdvmcall_output out;
	u64 err;

	WARN_ON_ONCE(tdx_is_context_switched_msr(msr));

	if (msr == MSR_CSTAR) {
		*ret = 0;
		return 0;
	}

	err = __tdvmcall(EXIT_REASON_MSR_READ, regs->ax, regs->cx, 0, 0, &out);
	if (err) {
		*ret -EIO;
		return 0;
	}
	return out.r11;
}


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions
  2021-03-19 16:55                 ` Sean Christopherson
@ 2021-03-19 17:42                   ` Kuppuswamy, Sathyanarayanan
  2021-03-19 18:22                     ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-03-19 17:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, linux-kernel

Hi Sean,

Thanks for the review.

On 3/19/21 9:55 AM, Sean Christopherson wrote:
> On Thu, Mar 18, 2021, Kuppuswamy Sathyanarayanan wrote:
>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>> index e44e55d1e519..7ae1d25e272b 100644
>> --- a/arch/x86/kernel/tdx.c
>> +++ b/arch/x86/kernel/tdx.c
>> @@ -4,6 +4,58 @@
>>   #include <asm/tdx.h>
>>   #include <asm/cpufeature.h>
>>   
>> +void tdcall(u64 leafid, struct tdcall_regs *regs)
>> +{
>> +	asm volatile(
>> +			/* RAX = leafid (TDCALL LEAF ID) */
>> +			"  movq %0, %%rax;"
>> +			/* Move regs->r[*] data to regs r[a-c]x,  r8-r5 */
>> +			"  movq 8(%1), %%rcx;"
> 
> I am super duper opposed to using inline asm.  Large blocks are hard to read,
I think this point is arguable. Based on the review comments I received so far,
people prefer inline assembly compared to asm sub functions.
> and even harder to maintain.  E.g. the %1 usage falls apart if an output
> constraint is added; that can be avoided by defining a local const/imm (I forget
> what they're called), but it doesn't help readability.
we can use OFFSET() calls to improve the readability and avoid this issue. Also IMO,
any one adding constraints should know how this would affect the asm code.
> 
>> +			"  movq 16(%1), %%rdx;"
>> +			"  movq 24(%1), %%r8;"
>> +			"  movq 32(%1), %%r9;"
>> +			"  movq 40(%1), %%r10;"
>> +			"  movq 48(%1), %%r11;"
>> +			"  movq 56(%1), %%r12;"
>> +			"  movq 64(%1), %%r13;"
>> +			"  movq 72(%1), %%r14;"
>> +			"  movq 80(%1), %%r15;"
> 
> This is extremely unsafe, and wasteful.  Putting the onus on the caller to zero
> out unused registers, with no mechanism to enforce/encourage doing so,
For encouragement, we can add a comment to this function about callers responsibility.
  makes it
> likely that the kernel will leak information to the VMM, e.g. in the form of
> stack data due to a partially initialized "regs".
Unless you create sub-functions for each use cases, callers cannot avoid this
responsibility.
> 
> And although TDVMCALL is anything but speedy, requiring multiple memory
> operations just to set a single register is unnecessary.  Not to mention
> several of these registers are never used in the GHCI-defined TDVMCALLs. 
This function is common between TDCALL and TDVMCALL. Extra registers you
mentioned are related to other TDCALL usecases.
  And,
> since the caller defines the mask (which I also dislike), it's possible/likely
> that many of these memory operations are wasteful even for registers that are
> used by _some_ TDVMCALLs.  Unnecessary accesses are inevitable if we want a
> common helper, but this is too much.
using single function makes it easy to maintain, readable and less error prone.
But I agree there are many unnecessary accesses for many users.
> 
>> +			TDCALL ";"
>> +			/* Save TDCALL success/failure to regs->rax */
>> +			"  movq %%rax, (%1);"
>> +			/* Save rcx and rdx contents to regs->r[c-d]x */
>> +			"  movq %%rcx, 8(%1);"
>> +			"  movq %%rdx, 16(%1);"
>> +			/* Move content of registers R8-R15 regs->r[8-15] */
>> +			"  movq %%r8, 24(%1);"
>> +			"  movq %%r9, 32(%1);"
>> +			"  movq %%r10, 40(%1);"
>> +			"  movq %%r11, 48(%1);"
>> +			"  movq %%r12, 56(%1);"
>> +			"  movq %%r13, 64(%1);"
>> +			"  movq %%r14, 72(%1);"
>> +			"  movq %%r15, 80(%1);"
>> +
>> +		:
>> +		: "r" (leafid), "r" (regs)
>> +		: "memory", "rax", "rbx", "rcx", "rdx", "r8",
>> +		  "r9", "r10", "r11", "r12", "r13", "r14", "r15"
> 
> All these clobbers mean even more memory operations...
> 
>> +		);
>> +
>> +}
>> +
>> +void tdvmcall(u64 subid, struct tdcall_regs *regs)
>> +{
>> +	/* Expose GPRs R8-R15 to VMM */
>> +	regs->rcx = 0xff00;
>> +	/* R10 = 0 (standard TDVMCALL) */
>> +	regs->r10 = TDVMCALL_STANDARD;
>> +	/* Save subid to r11 register */
>> +	regs->r11 = subid;
>> +
>> +	tdcall(TDVMCALL, regs);
> 
> This implies the caller is responsible for _all_ error checking.  The base
> TDCALL should never fail; if it does, something is horribly wrong with TDX-Module
> and panicking is absolutely the best option.
I haven't added error checking to common function because some use cases like
MSR and IO access does not need to panic with TDVMCALL failures.

To improve this, may be we can create sub-functions (similar to your code) like,
1. tdvmcall() //with BUG_ON(regs.rax)
2. _tdvmcall() // without error checks
> 
> The users of this are going to be difficult to read as well since the parameters
> are stuff into a struct instead of being passed to a function.
I think regs.rx = xx code format is more easier to read compared to passing
parameters to the function.
> 
> IMO, throwing the bulk of the code in a proper asm subroutine and handling only
> the GHCI-defined TDVMCALLs is the way to go.  If/when a VMM comes along that
> wants to enlighten Linux guests to work with non-GCHI TDVMCALLs, enhancing this
> madness can be their problem.
> 
> Completely untested...
> 
> struct tdvmcall_output {
> 	u64 r12;
> 	u64 r13;
> 	u64 r14;
> 	u64 r15;
> }
> 
> u64 __tdvmcall(u64 fn, u64 p0, u64 p1, u64 p2, u64 p3,
> 	       struct tdvmcall_output *out);
This function is only for tdvmcall. If you want to create
common function for all tdcall use cases, you would end
up using struct for input as well.
> 
> 	/* Offset for fields in tdvmcall_output */
> 	OFFSET(TDVMCALL_r12, tdvmcall_output, r13);
> 	OFFSET(TDVMCALL_r13, tdvmcall_output, r13);
> 	OFFSET(TDVMCALL_r14, tdvmcall_output, r14);
> 	OFFSET(TDVMCALL_r15, tdvmcall_output, r15);
> 
> SYM_FUNC_START(__tdvmcall)
> 	FRAME_BEGIN
> 
> 	/* Save/restore non-volatile GPRs that are exposed to the VMM. */
>          push %r15
>          push %r14
>          push %r13
>          push %r12
> 
> 	/*
> 	 * 0    => RAX = TDCALL leaf
> 	 * 0    => R10 = standard vs. vendor
> 	 * RDI  => R11 = TDVMCALL function, e.g. exit reason
> 	 * RSI  => R12 = input param 0
> 	 * RDX  => R13 = input param 1
> 	 * RCX  => R14 = input param 2
> 	 * R8   => R15 = input param 3
> 	 * MASK => RCX = TDVMCALL register behavior
> 	 * R9   => N/A = output struct
> 	 */
> 	xor %eax, %eax
>          xor %r10d, %r10d
> 	mov %rdi, %r11
> 	mov %rsi, %r12
> 	mov %rdx, %r13
> 	mov %rcx, %r14
> 	mov %r8,  %r15
> 
>          /*
> 	 * Expose R10 - R15, i.e. all GPRs that may be used by TDVMCALLs
> 	 * defined in the GHCI.  Note, RAX and RCX are consumed, but only by
> 	 * TDX-Module and so don't need to be listed in the mask.
> 	 */
>          movl $0xfc00, %ecx
> 
> 	tdcall
> 
> 	/* Panic if TDCALL reports failure. */
> 	test %rax, %rax
> 	jnz 2f
> 
> 	/* Propagate TDVMCALL success/failure to return value. */
> 	mov %r10, %rax
> 
> 	/*
> 	 * On success, propagate TDVMCALL outputs values to the output struct,
> 	 * if an output struct is provided.
> 	 */
> 	test %rax, %rax
> 	jnz 1f
> 	test %r9, %r9
> 	jz 1f
> 
> 	movq %r12, $TDVMCALL_r12(%r9)
> 	movq %r13, $TDVMCALL_r13(%r9)
> 	movq %r14, $TDVMCALL_r14(%r9)
> 	movq %r15, $TDVMCALL_r15(%r9)
> 1:
> 	/*
> 	 * Zero out registers exposed to the VMM to avoid speculative execution
> 	 * with VMM-controlled values.
> 	 */
>          xor %r10d, %r10d
>          xor %r11d, %r11d
>          xor %r12d, %r12d
>          xor %r13d, %r13d
>          xor %r14d, %r14d
>          xor %r15d, %r15d
> 
> 	pop %r12
>          pop %r13
>          pop %r14
>          pop %r15
> 
> 	FRAME_END
> 	ret
> 2:
> 	ud2
> SYM_FUNC_END(__tdvmcall)
> 
> /*
>   * Wrapper for the semi-common case where errors are fatal and there is a
>   * single output value.
>   */
> static inline u64 tdvmcall(u64 fn, u64 p0, u64 p1, u64 p2, u64 p3,
> 			   struct tdvmcall_output *out)
> {
> 	struct tdvmcall_output out;
> 	u64 err;
> 
> 	err = __tdvmcall(fn, p0, p1, p2, p3, &out);
> 	BUG_ON(err);
> 
> 	return out.r11;
> }
> 
> static void tdx_handle_cpuid(struct pt_regs *regs)
> {
> 	struct tdvmcall_output out;
> 	u64 err;
> 
> 	err = __tdvmcall(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0, &out);
> 	BUG_ON(err);
> 
>          regs->ax = out.r11;
>          regs->bx = out.r12;
>          regs->cx = out.r13;
>          regs->dx = out.r14;
> }
> 
> #define REG_MASK(size) ((1ULL << ((size) * 8)) - 1)
> 
> static void tdx_handle_io(struct pt_regs *regs, u32 exit_qual)
> {
>          u8 out = (exit_qual & 8) ? 0 : 1;
>          u8 size = (exit_qual & 7) + 1;
>          u16 port = exit_qual >> 16;
>          u64 val;
> 
>          /* I/O strings ops are unrolled at build time. */
>          BUG_ON(exit_qual & 0x10);
> 
> 	if (!tdx_allowed_port(port))
> 		return;
> 
>          if (out)
>                  val = regs->ax & REG_MASK(size);
>          else
>                  val = 0;
> 
>          val = tdvmcall(EXIT_REASON_IO_INSTRUCTION, port, size, out, val);
>          if (!out) {
>                  /* The upper bits of *AX are preserved for 2 and 1 byte I/O. */
>                  if (size < 4)
>                          val |= (regs->ax & ~REG_MASK(size));
>                  regs->ax = val;
>          }4
> }
> 
> static u64 tdx_read_msr_safe(unsigned int msr, int *ret)
> {
> 	struct tdvmcall_output out;
> 	u64 err;
> 
> 	WARN_ON_ONCE(tdx_is_context_switched_msr(msr));
> 
> 	if (msr == MSR_CSTAR) {
> 		*ret = 0;
> 		return 0;
> 	}
> 
> 	err = __tdvmcall(EXIT_REASON_MSR_READ, regs->ax, regs->cx, 0, 0, &out);
> 	if (err) {
> 		*ret -EIO;
> 		return 0;
> 	}
> 	return out.r11;
> }
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions
  2021-03-19 17:42                   ` Kuppuswamy, Sathyanarayanan
@ 2021-03-19 18:22                     ` Dave Hansen
  2021-03-19 19:58                       ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-03-19 18:22 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Sean Christopherson
  Cc: Peter Zijlstra, Andy Lutomirski, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On 3/19/21 10:42 AM, Kuppuswamy, Sathyanarayanan wrote:
>>> @@ -4,6 +4,58 @@
>>>   #include <asm/tdx.h>
>>>   #include <asm/cpufeature.h>
>>>   +void tdcall(u64 leafid, struct tdcall_regs *regs)
>>> +{
>>> +    asm volatile(
>>> +            /* RAX = leafid (TDCALL LEAF ID) */
>>> +            "  movq %0, %%rax;"
>>> +            /* Move regs->r[*] data to regs r[a-c]x,  r8-r5 */
>>> +            "  movq 8(%1), %%rcx;"
>> 
>> I am super duper opposed to using inline asm.  Large blocks are
>> hard to read,
> I think this point is arguable. Based on the review comments I
> received so far, people prefer inline assembly compared to asm sub
> functions.

It's arguable, but Sean makes a pretty compelling case.

I actually think inline assembly is a monstrosity.  It's insanely arcane
and, as I hope you have noted, does not scale nicely beyond one or two
instructions.

>> and even harder to maintain.  E.g. the %1 usage falls apart if an
>> output constraint is added; that can be avoided by defining a local
>> const/imm (I forget what they're called), but it doesn't help
>> readability.
> we can use OFFSET() calls to improve the readability and avoid this 
> issue. Also IMO, any one adding constraints should know how this
> would affect the asm code.

This is about *maintainability*.  How _easily_ can someone change this
code in the future?  Sean's arguing that it's *hard* to correctly add a
constraint.  Unfortunately, our supply of omnipotent kernel developers
is a bit short.

>>> +            "  movq 16(%1), %%rdx;"
>>> +            "  movq 24(%1), %%r8;"
>>> +            "  movq 32(%1), %%r9;"
>>> +            "  movq 40(%1), %%r10;"
>>> +            "  movq 48(%1), %%r11;"
>>> +            "  movq 56(%1), %%r12;"
>>> +            "  movq 64(%1), %%r13;"
>>> +            "  movq 72(%1), %%r14;"
>>> +            "  movq 80(%1), %%r15;"
>> 
>> This is extremely unsafe, and wasteful.  Putting the onus on the 
>> caller to zero out unused registers, with no mechanism to
>> enforce/encourage doing so,
> For encouragement, we can add a comment to this function about
> callers responsibility. makes it
>> likely that the kernel will leak information to the VMM, e.g. in
>> the form of stack data due to a partially initialized "regs".
> Unless you create sub-functions for each use cases, callers cannot
> avoid this responsibility.

I don't think we're quite at the point where we throw up our hands.

It would be pretty simple to have an initializer that zeros the
registers out, or looks at the argument mask and does it more precisely.
 Surely we can do *something*.

>>     /* Offset for fields in tdvmcall_output */
>>     OFFSET(TDVMCALL_r12, tdvmcall_output, r13);
>>     OFFSET(TDVMCALL_r13, tdvmcall_output, r13);
>>     OFFSET(TDVMCALL_r14, tdvmcall_output, r14);
>>     OFFSET(TDVMCALL_r15, tdvmcall_output, r15);
>>
>> SYM_FUNC_START(__tdvmcall)
>>     FRAME_BEGIN
>>
>>     /* Save/restore non-volatile GPRs that are exposed to the VMM. */
>>          push %r15
>>          push %r14
>>          push %r13
>>          push %r12

I might have some tweaks for the assembly once someone puts a real patch
together.  But, that looks a lot more sane than the inline assembly to me.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions
  2021-03-19 18:22                     ` Dave Hansen
@ 2021-03-19 19:58                       ` Kuppuswamy, Sathyanarayanan
  2021-03-26 23:38                         ` [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-03-19 19:58 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson
  Cc: Peter Zijlstra, Andy Lutomirski, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel



On 3/19/21 11:22 AM, Dave Hansen wrote:
> On 3/19/21 10:42 AM, Kuppuswamy, Sathyanarayanan wrote:
>>>> @@ -4,6 +4,58 @@
>>>>    #include <asm/tdx.h>
>>>>    #include <asm/cpufeature.h>
>>>>    +void tdcall(u64 leafid, struct tdcall_regs *regs)
>>>> +{
>>>> +    asm volatile(
>>>> +            /* RAX = leafid (TDCALL LEAF ID) */
>>>> +            "  movq %0, %%rax;"
>>>> +            /* Move regs->r[*] data to regs r[a-c]x,  r8-r5 */
>>>> +            "  movq 8(%1), %%rcx;"
>>>
>>> I am super duper opposed to using inline asm.  Large blocks are
>>> hard to read,
>> I think this point is arguable. Based on the review comments I
>> received so far, people prefer inline assembly compared to asm sub
>> functions.
> 
> It's arguable, but Sean makes a pretty compelling case.
> 
> I actually think inline assembly is a monstrosity.  It's insanely arcane
> and, as I hope you have noted, does not scale nicely beyond one or two
> instructions.
> 
>>> and even harder to maintain.  E.g. the %1 usage falls apart if an
>>> output constraint is added; that can be avoided by defining a local
>>> const/imm (I forget what they're called), but it doesn't help
>>> readability.
>> we can use OFFSET() calls to improve the readability and avoid this
>> issue. Also IMO, any one adding constraints should know how this
>> would affect the asm code.
> 
> This is about *maintainability*.  How _easily_ can someone change this
> code in the future?  Sean's arguing that it's *hard* to correctly add a
> constraint.  Unfortunately, our supply of omnipotent kernel developers
> is a bit short.
> 
>>>> +            "  movq 16(%1), %%rdx;"
>>>> +            "  movq 24(%1), %%r8;"
>>>> +            "  movq 32(%1), %%r9;"
>>>> +            "  movq 40(%1), %%r10;"
>>>> +            "  movq 48(%1), %%r11;"
>>>> +            "  movq 56(%1), %%r12;"
>>>> +            "  movq 64(%1), %%r13;"
>>>> +            "  movq 72(%1), %%r14;"
>>>> +            "  movq 80(%1), %%r15;"
>>>
>>> This is extremely unsafe, and wasteful.  Putting the onus on the
>>> caller to zero out unused registers, with no mechanism to
>>> enforce/encourage doing so,
>> For encouragement, we can add a comment to this function about
>> callers responsibility. makes it
>>> likely that the kernel will leak information to the VMM, e.g. in
>>> the form of stack data due to a partially initialized "regs".
>> Unless you create sub-functions for each use cases, callers cannot
>> avoid this responsibility.
> 
> I don't think we're quite at the point where we throw up our hands.
> 
> It would be pretty simple to have an initializer that zeros the
> registers out, or looks at the argument mask and does it more precisely.
>   Surely we can do *something*.
> 
>>>      /* Offset for fields in tdvmcall_output */
>>>      OFFSET(TDVMCALL_r12, tdvmcall_output, r13);
>>>      OFFSET(TDVMCALL_r13, tdvmcall_output, r13);
>>>      OFFSET(TDVMCALL_r14, tdvmcall_output, r14);
>>>      OFFSET(TDVMCALL_r15, tdvmcall_output, r15);
>>>
>>> SYM_FUNC_START(__tdvmcall)
>>>      FRAME_BEGIN
>>>
>>>      /* Save/restore non-volatile GPRs that are exposed to the VMM. */
>>>           push %r15
>>>           push %r14
>>>           push %r13
>>>           push %r12
> 
> I might have some tweaks for the assembly once someone puts a real patch
> together.  But, that looks a lot more sane than the inline assembly to me.
Ok. Let me use Sean's proposal and submit tested version of this patch.

Also, any thoughts on whether you want to use single function for
tdcall and tdvmcall?
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-03-19 19:58                       ` Kuppuswamy, Sathyanarayanan
@ 2021-03-26 23:38                         ` Kuppuswamy Sathyanarayanan
  2021-04-20 17:36                           ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-03-26 23:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Implement common helper functions to communicate with
the TDX Module and VMM (using TDCALL instruction).

__tdvmcall() function can be used to request services
from VMM.

__tdcall() function can be used to communicate with the
TDX Module.

Using common helper functions makes the code more readable
and less error prone compared to distributed and use case
specific inline assembly code. Only downside in using this
approach is, it adds a few extra instructions for every
TDCALL use case when compared to distributed checks. Although
it's a bit less efficient, it's worth it to make the code more
readable.

Originally-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Hi All,

Please let me know your review comments. If you agree with this patch
and want to see the use of these APIs in rest of the patches, I will
re-send the patch series with updated code. Please let me know.

Changes since v1:
 * Implemented tdvmcall and tdcall helper functions as assembly code.
 * Followed suggestion provided by Sean & Dave.

 arch/x86/include/asm/tdx.h    |  23 +++++
 arch/x86/kernel/Makefile      |   2 +-
 arch/x86/kernel/asm-offsets.c |  22 +++++
 arch/x86/kernel/tdcall.S      | 163 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c         |  30 +++++++
 5 files changed, 239 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..ce6212ce5f45 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,35 @@
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
+#include <linux/types.h>
+
+struct tdcall_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+struct tdvmcall_output {
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+};
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
 void __init tdx_early_init(void);
 
+u64 __tdcall(u64 fn, u64 rcx, u64 rdx, struct tdcall_output *out);
+
+u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+	       struct tdvmcall_output *out);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..72de0b49467e 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
 #include <xen/interface/xen.h>
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
 #ifdef CONFIG_X86_32
 # include "asm-offsets_32.c"
 #else
@@ -75,6 +79,24 @@ static void __used common(void)
 	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+	BLANK();
+	/* Offset for fields in tdcall_output */
+	OFFSET(TDCALL_rcx, tdcall_output, rcx);
+	OFFSET(TDCALL_rdx, tdcall_output, rdx);
+	OFFSET(TDCALL_r8, tdcall_output, r8);
+	OFFSET(TDCALL_r9, tdcall_output, r9);
+	OFFSET(TDCALL_r10, tdcall_output, r10);
+	OFFSET(TDCALL_r11, tdcall_output, r11);
+
+	/* Offset for fields in tdvmcall_output */
+	OFFSET(TDVMCALL_r11, tdvmcall_output, r11);
+	OFFSET(TDVMCALL_r12, tdvmcall_output, r12);
+	OFFSET(TDVMCALL_r13, tdvmcall_output, r13);
+	OFFSET(TDVMCALL_r14, tdvmcall_output, r14);
+	OFFSET(TDVMCALL_r15, tdvmcall_output, r15);
+#endif
+
 	BLANK();
 	OFFSET(BP_scratch, boot_params, scratch);
 	OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..a73b67c0b407
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,163 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+
+#define TDVMCALL_EXPOSE_REGS_MASK	0xfc00
+
+/*
+ * TDCALL instruction is newly added in TDX architecture,
+ * used by TD for requesting the host VMM to provide
+ * (untrusted) services. Supported in Binutils >= 2.36
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/* Only for non TDVMCALL use cases */
+SYM_FUNC_START(__tdcall)
+	FRAME_BEGIN
+
+	/* Save/restore non-volatile GPRs that are exposed to the VMM. */
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+
+	/*
+	 * RDI  => RAX = TDCALL leaf
+	 * RSI  => RCX = input param 1
+	 * RDX  => RDX = input param 2
+	 * RCX  => N/A = output struct
+	 */
+
+	/* Save output pointer to R12 */
+	mov %rcx, %r12
+	/* Move TDCALL Leaf ID to RAX */
+	mov %rdi, %rax
+	/* Move input param 1 to rcx*/
+	mov %rsi, %rcx
+
+	tdcall
+
+	/*
+	 * On success, propagate TDCALL outputs values to the output struct,
+	 * if an output struct is provided.
+	 */
+	test %rax, %rax
+	jnz 1f
+	test %r12, %r12
+	jz 1f
+
+	movq %rcx, TDCALL_rcx(%r12)
+	movq %rdx, TDCALL_rdx(%r12)
+	movq %r8, TDCALL_r8(%r12)
+	movq %r9, TDCALL_r9(%r12)
+	movq %r10, TDCALL_r10(%r12)
+	movq %r11, TDCALL_r11(%r12)
+1:
+	/*
+	 * Zero out registers exposed to the VMM to avoid speculative execution
+	 * with VMM-controlled values.
+	 */
+        xor %rcx, %rcx
+        xor %rdx, %rdx
+        xor %r8d, %r8d
+        xor %r9d, %r9d
+        xor %r10d, %r10d
+        xor %r11d, %r11d
+
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	FRAME_END
+	ret
+SYM_FUNC_END(__tdcall)
+
+.macro tdvmcall_core
+	FRAME_BEGIN
+
+	/* Save/restore non-volatile GPRs that are exposed to the VMM. */
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+
+	/*
+	 * 0    => RAX = TDCALL leaf
+	 * RDI  => R11 = TDVMCALL function, e.g. exit reason
+	 * RSI  => R12 = input param 0
+	 * RDX  => R13 = input param 1
+	 * RCX  => R14 = input param 2
+	 * R8   => R15 = input param 3
+	 * MASK => RCX = TDVMCALL register behavior
+	 * R9   => R9  = output struct
+	 */
+
+	xor %eax, %eax
+	mov %rdi, %r11
+	mov %rsi, %r12
+	mov %rdx, %r13
+	mov %rcx, %r14
+	mov %r8,  %r15
+
+	/*
+	 * Expose R10 - R15, i.e. all GPRs that may be used by TDVMCALLs
+	 * defined in the GHCI.  Note, RAX and RCX are consumed, but only by
+	 * TDX-Module and so don't need to be listed in the mask.
+	 */
+	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+	tdcall
+
+	/* Panic if TDCALL reports failure. */
+	test %rax, %rax
+	jnz 2f
+
+	/* Propagate TDVMCALL success/failure to return value. */
+	mov %r10, %rax
+
+	/*
+	 * On success, propagate TDVMCALL outputs values to the output struct,
+	 * if an output struct is provided.
+	 */
+	test %rax, %rax
+	jnz 1f
+	test %r9, %r9
+	jz 1f
+
+	movq %r11, TDVMCALL_r11(%r9)
+	movq %r12, TDVMCALL_r12(%r9)
+	movq %r13, TDVMCALL_r13(%r9)
+	movq %r14, TDVMCALL_r14(%r9)
+	movq %r15, TDVMCALL_r15(%r9)
+1:
+	/*
+	 * Zero out registers exposed to the VMM to avoid speculative execution
+	 * with VMM-controlled values.
+	 */
+	xor %r10d, %r10d
+	xor %r11d, %r11d
+	xor %r12d, %r12d
+	xor %r13d, %r13d
+	xor %r14d, %r14d
+	xor %r15d, %r15d
+
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	FRAME_END
+	ret
+2:
+	ud2
+.endm
+
+SYM_FUNC_START(__tdvmcall)
+	xor %r10, %r10
+	tdvmcall_core
+SYM_FUNC_END(__tdvmcall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 0d00dd50a6ff..1147e7e765d6 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -3,6 +3,36 @@
 
 #include <asm/tdx.h>
 
+/*
+ * Wrapper for the common case with standard output value (R10).
+ */
+static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+	u64 err;
+
+	err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
+
+	WARN_ON(err);
+
+	return err;
+}
+
+/*
+ * Wrapper for the semi-common case where we need single output value (R11).
+ */
+static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+
+	struct tdvmcall_output out = {0};
+	u64 err;
+
+	err = __tdvmcall(fn, r12, r13, r14, r15, &out);
+
+	WARN_ON(err);
+
+	return out.r11;
+}
+
 static inline bool cpuid_has_tdx_guest(void)
 {
 	u32 eax, signature[3];
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-02-06  1:05       ` Andy Lutomirski
@ 2021-03-27  0:18         ` Kuppuswamy Sathyanarayanan
  2021-03-27  2:40           ` Andy Lutomirski
  2021-03-30  4:56           ` [PATCH v1 " Xiaoyao Li
  0 siblings, 2 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-03-27  0:18 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
are not supported. So handle #VE due to these instructions as no ops.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---

Changes since previous series:
 * Suppressed MWAIT feature as per Andi's comment.
 * Added warning debug log for MWAIT #VE exception.

 arch/x86/kernel/tdx.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e936b2f88bf6..fb7d22b846fc 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -308,6 +308,9 @@ void __init tdx_early_init(void)
 
 	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
+	/* MWAIT is not supported in TDX platform, so suppress it */
+	setup_clear_cpu_cap(X86_FEATURE_MWAIT);
+
 	tdg_get_info();
 
 	pv_ops.irq.safe_halt = tdg_safe_halt;
@@ -362,6 +365,26 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_EPT_VIOLATION:
 		ve->instr_len = tdg_handle_mmio(regs, ve);
 		break;
+	/*
+	 * Per Guest-Host-Communication Interface (GHCI) for Intel Trust
+	 * Domain Extensions (Intel TDX) specification, sec 2.4,
+	 * some instructions that unconditionally cause #VE (such as WBINVD,
+	 * MONITOR, MWAIT) do not have corresponding TDCALL
+	 * [TDG.VP.VMCALL <Instruction>] leaves, since the TD has been designed
+	 * with no deterministic way to confirm the result of those operations
+	 * performed by the host VMM.  In those cases, the goal is for the TD
+	 * #VE handler to increment the RIP appropriately based on the VE
+	 * information provided via TDCALL.
+	 */
+	case EXIT_REASON_WBINVD:
+		pr_warn_once("WBINVD #VE Exception\n");
+	case EXIT_REASON_MONITOR_INSTRUCTION:
+		/* Handle as nops. */
+		break;
+	case EXIT_REASON_MWAIT_INSTRUCTION:
+		/* MWAIT is supressed, not supposed to reach here. */
+		pr_warn("MWAIT unexpected #VE Exception\n");
+		return -EFAULT;
 	default:
 		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-27  0:18         ` [PATCH v1 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-03-27  2:40           ` Andy Lutomirski
  2021-03-27  3:40             ` Kuppuswamy, Sathyanarayanan
  2021-03-30  4:56           ` [PATCH v1 " Xiaoyao Li
  1 sibling, 1 reply; 161+ messages in thread
From: Andy Lutomirski @ 2021-03-27  2:40 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel



> On Mar 26, 2021, at 5:18 PM, Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> 
> In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> are not supported. So handle #VE due to these instructions as no ops.

These should at least be WARN.

Does TDX send #UD if these instructions have the wrong CPL?  If the #VE came from user mode, we should send an appropriate signal instead.

> 
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> ---
> 
> Changes since previous series:
> * Suppressed MWAIT feature as per Andi's comment.
> * Added warning debug log for MWAIT #VE exception.
> 
> arch/x86/kernel/tdx.c | 23 +++++++++++++++++++++++
> 1 file changed, 23 insertions(+)
> 
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index e936b2f88bf6..fb7d22b846fc 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -308,6 +308,9 @@ void __init tdx_early_init(void)
> 
>    setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
> 
> +    /* MWAIT is not supported in TDX platform, so suppress it */
> +    setup_clear_cpu_cap(X86_FEATURE_MWAIT);
> +
>    tdg_get_info();
> 
>    pv_ops.irq.safe_halt = tdg_safe_halt;
> @@ -362,6 +365,26 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
>    case EXIT_REASON_EPT_VIOLATION:
>        ve->instr_len = tdg_handle_mmio(regs, ve);
>        break;
> +    /*
> +     * Per Guest-Host-Communication Interface (GHCI) for Intel Trust
> +     * Domain Extensions (Intel TDX) specification, sec 2.4,
> +     * some instructions that unconditionally cause #VE (such as WBINVD,
> +     * MONITOR, MWAIT) do not have corresponding TDCALL
> +     * [TDG.VP.VMCALL <Instruction>] leaves, since the TD has been designed
> +     * with no deterministic way to confirm the result of those operations
> +     * performed by the host VMM.  In those cases, the goal is for the TD
> +     * #VE handler to increment the RIP appropriately based on the VE
> +     * information provided via TDCALL.
> +     */
> +    case EXIT_REASON_WBINVD:
> +        pr_warn_once("WBINVD #VE Exception\n");
> +    case EXIT_REASON_MONITOR_INSTRUCTION:
> +        /* Handle as nops. */
> +        break;
> +    case EXIT_REASON_MWAIT_INSTRUCTION:
> +        /* MWAIT is supressed, not supposed to reach here. */
> +        pr_warn("MWAIT unexpected #VE Exception\n");
> +        return -EFAULT;
>    default:
>        pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
>        return -EFAULT;
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-27  2:40           ` Andy Lutomirski
@ 2021-03-27  3:40             ` Kuppuswamy, Sathyanarayanan
  2021-03-27 16:03               ` Andy Lutomirski
  0 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-03-27  3:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel



On 3/26/21 7:40 PM, Andy Lutomirski wrote:
> 
> 
>> On Mar 26, 2021, at 5:18 PM, Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>>
>> In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
>> are not supported. So handle #VE due to these instructions as no ops.
> 
> These should at least be WARN.
I will change it to WARN.
> 
> Does TDX send #UD if these instructions have the wrong CPL?  
No, TDX does not trigger #UD for these instructions.
If the #VE came from user mode, we should send an appropriate signal instead.
It will be mapped into #GP(0) fault. This should be enough notification right?
> 
>>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>> ---
>>
>> Changes since previous series:
>> * Suppressed MWAIT feature as per Andi's comment.
>> * Added warning debug log for MWAIT #VE exception.
>>
>> arch/x86/kernel/tdx.c | 23 +++++++++++++++++++++++
>> 1 file changed, 23 insertions(+)
>>
>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>> index e936b2f88bf6..fb7d22b846fc 100644
>> --- a/arch/x86/kernel/tdx.c
>> +++ b/arch/x86/kernel/tdx.c
>> @@ -308,6 +308,9 @@ void __init tdx_early_init(void)
>>
>>     setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
>>
>> +    /* MWAIT is not supported in TDX platform, so suppress it */
>> +    setup_clear_cpu_cap(X86_FEATURE_MWAIT);
>> +
>>     tdg_get_info();
>>
>>     pv_ops.irq.safe_halt = tdg_safe_halt;
>> @@ -362,6 +365,26 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
>>     case EXIT_REASON_EPT_VIOLATION:
>>         ve->instr_len = tdg_handle_mmio(regs, ve);
>>         break;
>> +    /*
>> +     * Per Guest-Host-Communication Interface (GHCI) for Intel Trust
>> +     * Domain Extensions (Intel TDX) specification, sec 2.4,
>> +     * some instructions that unconditionally cause #VE (such as WBINVD,
>> +     * MONITOR, MWAIT) do not have corresponding TDCALL
>> +     * [TDG.VP.VMCALL <Instruction>] leaves, since the TD has been designed
>> +     * with no deterministic way to confirm the result of those operations
>> +     * performed by the host VMM.  In those cases, the goal is for the TD
>> +     * #VE handler to increment the RIP appropriately based on the VE
>> +     * information provided via TDCALL.
>> +     */
>> +    case EXIT_REASON_WBINVD:
>> +        pr_warn_once("WBINVD #VE Exception\n");
>> +    case EXIT_REASON_MONITOR_INSTRUCTION:
>> +        /* Handle as nops. */
>> +        break;
>> +    case EXIT_REASON_MWAIT_INSTRUCTION:
>> +        /* MWAIT is supressed, not supposed to reach here. */
>> +        pr_warn("MWAIT unexpected #VE Exception\n");
>> +        return -EFAULT;
>>     default:
>>         pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
>>         return -EFAULT;
>> -- 
>> 2.25.1
>>

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-27  3:40             ` Kuppuswamy, Sathyanarayanan
@ 2021-03-27 16:03               ` Andy Lutomirski
  2021-03-27 22:54                 ` [PATCH v2 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Andy Lutomirski @ 2021-03-27 16:03 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel




> On Mar 26, 2021, at 8:40 PM, Kuppuswamy, Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> 
> 
> 
> On 3/26/21 7:40 PM, Andy Lutomirski wrote:
>>>> On Mar 26, 2021, at 5:18 PM, Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>>> 
>>> In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
>>> are not supported. So handle #VE due to these instructions as no ops.
>> These should at least be WARN.
> I will change it to WARN.
>> Does TDX send #UD if these instructions have the wrong CPL?  
> No, TDX does not trigger #UD for these instructions.
> If the #VE came from user mode, we should send an appropriate signal instead.
> It will be mapped into #GP(0) fault. This should be enough notification right?

Yes. And I did mean #GP, not #UD.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-27 16:03               ` Andy Lutomirski
@ 2021-03-27 22:54                 ` Kuppuswamy Sathyanarayanan
  2021-03-29 17:14                   ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-03-27 22:54 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
are not supported. So handle #VE due to these instructions as no ops.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---

Changes since v1:
 * Added WARN() for MWAIT #VE exception.

Changes since previous series:
 * Suppressed MWAIT feature as per Andi's comment.
 * Added warning debug log for MWAIT #VE exception.

 arch/x86/kernel/tdx.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e936b2f88bf6..3c6158a53c17 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -308,6 +308,9 @@ void __init tdx_early_init(void)
 
 	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
+	/* MWAIT is not supported in TDX platform, so suppress it */
+	setup_clear_cpu_cap(X86_FEATURE_MWAIT);
+
 	tdg_get_info();
 
 	pv_ops.irq.safe_halt = tdg_safe_halt;
@@ -362,6 +365,26 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_EPT_VIOLATION:
 		ve->instr_len = tdg_handle_mmio(regs, ve);
 		break;
+	/*
+	 * Per Guest-Host-Communication Interface (GHCI) for Intel Trust
+	 * Domain Extensions (Intel TDX) specification, sec 2.4,
+	 * some instructions that unconditionally cause #VE (such as WBINVD,
+	 * MONITOR, MWAIT) do not have corresponding TDCALL
+	 * [TDG.VP.VMCALL <Instruction>] leaves, since the TD has been designed
+	 * with no deterministic way to confirm the result of those operations
+	 * performed by the host VMM.  In those cases, the goal is for the TD
+	 * #VE handler to increment the RIP appropriately based on the VE
+	 * information provided via TDCALL.
+	 */
+	case EXIT_REASON_WBINVD:
+		pr_warn_once("WBINVD #VE Exception\n");
+	case EXIT_REASON_MONITOR_INSTRUCTION:
+		/* Handle as nops. */
+		break;
+	case EXIT_REASON_MWAIT_INSTRUCTION:
+		/* MWAIT is supressed, not supposed to reach here. */
+		WARN(1, "MWAIT unexpected #VE Exception\n");
+		return -EFAULT;
 	default:
 		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-27 22:54                 ` [PATCH v2 " Kuppuswamy Sathyanarayanan
@ 2021-03-29 17:14                   ` Dave Hansen
  2021-03-29 21:55                     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-03-29 17:14 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 3/27/21 3:54 PM, Kuppuswamy Sathyanarayanan wrote:
> +	/*
> +	 * Per Guest-Host-Communication Interface (GHCI) for Intel Trust
> +	 * Domain Extensions (Intel TDX) specification, sec 2.4,
> +	 * some instructions that unconditionally cause #VE (such as WBINVD,
> +	 * MONITOR, MWAIT) do not have corresponding TDCALL
> +	 * [TDG.VP.VMCALL <Instruction>] leaves, since the TD has been designed
> +	 * with no deterministic way to confirm the result of those operations
> +	 * performed by the host VMM.  In those cases, the goal is for the TD
> +	 * #VE handler to increment the RIP appropriately based on the VE
> +	 * information provided via TDCALL.
> +	 */

That's an awfully big comment.  Could you pare it down, please?  Maybe
focus on the fact that we should never get here and why, rather than
talking about some silly spec?

> +	case EXIT_REASON_WBINVD:
> +		pr_warn_once("WBINVD #VE Exception\n");

I actually think WBINVD in here should oops.  We use it for some really
important things.  If it can't be executed, and we're depending on it,
the kernel is in deep, deep trouble.

I think a noop here is dangerous.

> +	case EXIT_REASON_MONITOR_INSTRUCTION:
> +		/* Handle as nops. */
> +		break;

MONITOR is a privileged instruction, right?  So we can only end up in
here if the kernel screws up and isn't reading CPUID correctly, right?

That dosen't seem to me like something we want to suppress.  This needs
a warning, at least.  I assume that having a MONITOR instruction
immediately return doesn't do any harm.

> +	case EXIT_REASON_MWAIT_INSTRUCTION:
> +		/* MWAIT is supressed, not supposed to reach here. */
> +		WARN(1, "MWAIT unexpected #VE Exception\n");
> +		return -EFAULT;

How is MWAIT "supppressed"?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-29 17:14                   ` Dave Hansen
@ 2021-03-29 21:55                     ` Kuppuswamy, Sathyanarayanan
  2021-03-29 22:02                       ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-03-29 21:55 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel



On 3/29/21 10:14 AM, Dave Hansen wrote:
> On 3/27/21 3:54 PM, Kuppuswamy Sathyanarayanan wrote:
>> +	/*
>> +	 * Per Guest-Host-Communication Interface (GHCI) for Intel Trust
>> +	 * Domain Extensions (Intel TDX) specification, sec 2.4,
>> +	 * some instructions that unconditionally cause #VE (such as WBINVD,
>> +	 * MONITOR, MWAIT) do not have corresponding TDCALL
>> +	 * [TDG.VP.VMCALL <Instruction>] leaves, since the TD has been designed
>> +	 * with no deterministic way to confirm the result of those operations
>> +	 * performed by the host VMM.  In those cases, the goal is for the TD
>> +	 * #VE handler to increment the RIP appropriately based on the VE
>> +	 * information provided via TDCALL.
>> +	 */

> 
> That's an awfully big comment.  Could you pare it down, please?  Maybe
> focus on the fact that we should never get here and why, rather than
> talking about some silly spec?
I will remove this and add individual one line comment for WBINVD and MONITOR
instructions. Some thing like "Privileged instruction, can only be executed
in ring 0. So raise a BUG.
> 
>> +	case EXIT_REASON_WBINVD:
>> +		pr_warn_once("WBINVD #VE Exception\n");
> 
> I actually think WBINVD in here should oops.  We use it for some really
> important things.  If it can't be executed, and we're depending on it,
> the kernel is in deep, deep trouble.
Agree. I will call BUG().
> 
> I think a noop here is dangerous.
> 
>> +	case EXIT_REASON_MONITOR_INSTRUCTION:
>> +		/* Handle as nops. */
>> +		break;
> 
> MONITOR is a privileged instruction, right?  So we can only end up in
> here if the kernel screws up and isn't reading CPUID correctly, right?
> 
> That dosen't seem to me like something we want to suppress.  This needs
> a warning, at least.  I assume that having a MONITOR instruction
> immediately return doesn't do any harm.
Agree. Since we are not supposed to come here, I will use BUG.
> 
>> +	case EXIT_REASON_MWAIT_INSTRUCTION:
>> +		/* MWAIT is supressed, not supposed to reach here. */
>> +		WARN(1, "MWAIT unexpected #VE Exception\n");
>> +		return -EFAULT;
> 
> How is MWAIT "supppressed"?
I am clearing the MWAIT feature flag in early init code. We should also disable
this feature in firmware.
setup_clear_cpu_cap(X86_FEATURE_MWAIT);
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-29 21:55                     ` Kuppuswamy, Sathyanarayanan
@ 2021-03-29 22:02                       ` Dave Hansen
  2021-03-29 22:09                         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-03-29 22:02 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 3/29/21 2:55 PM, Kuppuswamy, Sathyanarayanan wrote:
>>
>> MONITOR is a privileged instruction, right?  So we can only end up in
>> here if the kernel screws up and isn't reading CPUID correctly, right?
>>
>> That dosen't seem to me like something we want to suppress.  This needs
>> a warning, at least.  I assume that having a MONITOR instruction
>> immediately return doesn't do any harm.
> Agree. Since we are not supposed to come here, I will use BUG.

"This is unexpected" is a WARN()able offense.

"This is unexpected and might be corrupting data" is where we want to
use BUG().

Does broken MONITOR risk data corruption?

>>> +    case EXIT_REASON_MWAIT_INSTRUCTION:
>>> +        /* MWAIT is supressed, not supposed to reach here. */
>>> +        WARN(1, "MWAIT unexpected #VE Exception\n");
>>> +        return -EFAULT;
>>
>> How is MWAIT "supppressed"?
> I am clearing the MWAIT feature flag in early init code. We should also
> disable this feature in firmware. setup_clear_cpu_cap(X86_FEATURE_MWAIT);

I'd be more explicit about that.  Maybe even reference the code that
clears the X86_FEATURE.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-29 22:02                       ` Dave Hansen
@ 2021-03-29 22:09                         ` Kuppuswamy, Sathyanarayanan
  2021-03-29 22:12                           ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-03-29 22:09 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel



On 3/29/21 3:02 PM, Dave Hansen wrote:
> On 3/29/21 2:55 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>
>>> MONITOR is a privileged instruction, right?  So we can only end up in
>>> here if the kernel screws up and isn't reading CPUID correctly, right?
>>>
>>> That dosen't seem to me like something we want to suppress.  This needs
>>> a warning, at least.  I assume that having a MONITOR instruction
>>> immediately return doesn't do any harm.
>> Agree. Since we are not supposed to come here, I will use BUG.
> 
> "This is unexpected" is a WARN()able offense.
> 
> "This is unexpected and might be corrupting data" is where we want to
> use BUG().
> 
> Does broken MONITOR risk data corruption?
We will be reaching this point only if something is buggy in kernel. I am
not sure about impact of this buggy state. But MONITOR instruction by
itself, should not cause data corruption.

> 
>>>> +    case EXIT_REASON_MWAIT_INSTRUCTION:
>>>> +        /* MWAIT is supressed, not supposed to reach here. */
>>>> +        WARN(1, "MWAIT unexpected #VE Exception\n");
>>>> +        return -EFAULT;
>>>
>>> How is MWAIT "supppressed"?
>> I am clearing the MWAIT feature flag in early init code. We should also
>> disable this feature in firmware. setup_clear_cpu_cap(X86_FEATURE_MWAIT);
> 
> I'd be more explicit about that.  Maybe even reference the code that
> clears the X86_FEATURE.
This change is part of the same patch.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-29 22:09                         ` Kuppuswamy, Sathyanarayanan
@ 2021-03-29 22:12                           ` Dave Hansen
  2021-03-29 22:42                             ` Kuppuswamy, Sathyanarayanan
  2021-03-29 23:16                             ` [PATCH v3 " Kuppuswamy Sathyanarayanan
  0 siblings, 2 replies; 161+ messages in thread
From: Dave Hansen @ 2021-03-29 22:12 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 3/29/21 3:09 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>>> +    case EXIT_REASON_MWAIT_INSTRUCTION:
>>>>> +        /* MWAIT is supressed, not supposed to reach here. */
>>>>> +        WARN(1, "MWAIT unexpected #VE Exception\n");
>>>>> +        return -EFAULT;
>>>>
>>>> How is MWAIT "supppressed"?
>>> I am clearing the MWAIT feature flag in early init code. We should also
>>> disable this feature in firmware.
>>> setup_clear_cpu_cap(X86_FEATURE_MWAIT);
>>
>> I'd be more explicit about that.  Maybe even reference the code that
>> clears the X86_FEATURE.
> This change is part of the same patch.

Right, but if someone goes and looks at the switch() statement in 10
years is it going to be obvious how MWAIT was "suppressed"?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-29 22:12                           ` Dave Hansen
@ 2021-03-29 22:42                             ` Kuppuswamy, Sathyanarayanan
  2021-03-29 23:16                             ` [PATCH v3 " Kuppuswamy Sathyanarayanan
  1 sibling, 0 replies; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-03-29 22:42 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel



On 3/29/21 3:12 PM, Dave Hansen wrote:
> On 3/29/21 3:09 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>>>> +    case EXIT_REASON_MWAIT_INSTRUCTION:
>>>>>> +        /* MWAIT is supressed, not supposed to reach here. */
>>>>>> +        WARN(1, "MWAIT unexpected #VE Exception\n");
>>>>>> +        return -EFAULT;
>>>>>
>>>>> How is MWAIT "supppressed"?
>>>> I am clearing the MWAIT feature flag in early init code. We should also
>>>> disable this feature in firmware.
>>>> setup_clear_cpu_cap(X86_FEATURE_MWAIT);
>>>
>>> I'd be more explicit about that.  Maybe even reference the code that
>>> clears the X86_FEATURE.
>> This change is part of the same patch.
> 
> Right, but if someone goes and looks at the switch() statement in 10
> years is it going to be obvious how MWAIT was "suppressed"?
Ok. I can add a comment line for it.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-29 22:12                           ` Dave Hansen
  2021-03-29 22:42                             ` Kuppuswamy, Sathyanarayanan
@ 2021-03-29 23:16                             ` Kuppuswamy Sathyanarayanan
  2021-03-29 23:23                               ` Andy Lutomirski
  2021-03-29 23:38                               ` Dave Hansen
  1 sibling, 2 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-03-29 23:16 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
are not supported. So handle #VE due to these instructions
appropriately.

Since the impact of executing WBINVD instruction in non ring-0 mode
can be heavy, use BUG() to report it. For others, raise a WARNING
message.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---

Changes since v2:
 * Added BUG() for WBINVD, WARN for MONITOR instructions.
 * Fixed comments as per Dave's review.

Changes since v1:
 * Added WARN() for MWAIT #VE exception.

Changes since previous series:
 * Suppressed MWAIT feature as per Andi's comment.
 * Added warning debug log for MWAIT #VE exception.

 arch/x86/kernel/tdx.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e936b2f88bf6..4c6336a055a3 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -308,6 +308,9 @@ void __init tdx_early_init(void)
 
 	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
+	/* MWAIT is not supported in TDX platform, so suppress it */
+	setup_clear_cpu_cap(X86_FEATURE_MWAIT);
+
 	tdg_get_info();
 
 	pv_ops.irq.safe_halt = tdg_safe_halt;
@@ -362,6 +365,32 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_EPT_VIOLATION:
 		ve->instr_len = tdg_handle_mmio(regs, ve);
 		break;
+	case EXIT_REASON_WBINVD:
+		/*
+		 * WBINVD is a privileged instruction, can only be executed
+		 * in ring 0. Since we reached here, the kernel is in buggy
+		 * state.
+		 */
+		pr_err("WBINVD #VE Exception\n");
+		BUG();
+		break;
+	case EXIT_REASON_MONITOR_INSTRUCTION:
+		/*
+		 * MONITOR is a privileged instruction, can only be executed
+		 * in ring 0. So we are not supposed to reach here. Raise a
+		 * warning message.
+		 */
+		WARN(1, "MONITOR unexpected #VE Exception\n");
+		break;
+	case EXIT_REASON_MWAIT_INSTRUCTION:
+		/*
+		 * MWAIT feature is suppressed in firmware and in
+		 * tdx_early_init() by clearing X86_FEATURE_MWAIT CPU feature
+		 * flag. Since we are not supposed to reach here, raise a
+		 * warning message and return -EFAULT.
+		 */
+		WARN(1, "MWAIT unexpected #VE Exception\n");
+		return -EFAULT;
 	default:
 		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-29 23:16                             ` [PATCH v3 " Kuppuswamy Sathyanarayanan
@ 2021-03-29 23:23                               ` Andy Lutomirski
  2021-03-29 23:37                                 ` Kuppuswamy, Sathyanarayanan
  2021-03-29 23:39                                 ` [PATCH v3 " Sean Christopherson
  2021-03-29 23:38                               ` Dave Hansen
  1 sibling, 2 replies; 161+ messages in thread
From: Andy Lutomirski @ 2021-03-29 23:23 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel


> On Mar 29, 2021, at 4:17 PM, Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> 
> In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> are not supported. So handle #VE due to these instructions
> appropriately.

Is there something I missed elsewhere in the code that checks CPL?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-29 23:23                               ` Andy Lutomirski
@ 2021-03-29 23:37                                 ` Kuppuswamy, Sathyanarayanan
  2021-03-29 23:42                                   ` Sean Christopherson
  2021-03-29 23:39                                 ` [PATCH v3 " Sean Christopherson
  1 sibling, 1 reply; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-03-29 23:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel



On 3/29/21 4:23 PM, Andy Lutomirski wrote:
> 
>> On Mar 29, 2021, at 4:17 PM, Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>>
>> In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
>> are not supported. So handle #VE due to these instructions
>> appropriately.
> 
> Is there something I missed elsewhere in the code that checks CPL?
We don't check for CPL explicitly. But if we are reaching here, then we
executing these instructions with wrong CPL.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-29 23:16                             ` [PATCH v3 " Kuppuswamy Sathyanarayanan
  2021-03-29 23:23                               ` Andy Lutomirski
@ 2021-03-29 23:38                               ` Dave Hansen
  1 sibling, 0 replies; 161+ messages in thread
From: Dave Hansen @ 2021-03-29 23:38 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 3/29/21 4:16 PM, Kuppuswamy Sathyanarayanan wrote:
> In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> are not supported. So handle #VE due to these instructions
> appropriately.

This misses a key detail:

	"are not supported" ... and other patches have prevented a guest
	from using these instructions.

In other words, they're bad now, and we know it.  We tried to keep the
kernel from using them, but we failed.  Whoops.

> Since the impact of executing WBINVD instruction in non ring-0 mode
> can be heavy, use BUG() to report it. For others, raise a WARNING
> message.

"Heavy" makes it sound like there's a performance impact.



>  	pv_ops.irq.safe_halt = tdg_safe_halt;
> @@ -362,6 +365,32 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
>  	case EXIT_REASON_EPT_VIOLATION:
>  		ve->instr_len = tdg_handle_mmio(regs, ve);
>  		break;
> +	case EXIT_REASON_WBINVD:
> +		/*
> +		 * WBINVD is a privileged instruction, can only be executed
> +		 * in ring 0. Since we reached here, the kernel is in buggy
> +		 * state.
> +		 */
> +		pr_err("WBINVD #VE Exception\n");

"#VE Exception" is basically wasted bytes.  It really tells us nothing.
 This, on the other hand:

	"TDX guest used unsupported WBINVD instruction"

gives us the clues that TDX is involved and that the kernel used the
instruction.  The fact that #VE itself is involved is kinda irrelevant.

I also hate the comment.  Don't waste the bytes saying we're in a buggy
state.  That's kinda obvious from the BUG().

Again, it might be nice to mention in the changelog *WHY* we're so sure
that WBINVD won't be called from a TDX guest.  What did we do to
guarantee that?  How could we be sure that someone looking at the splat
that this generates and then reading the comment can go fix the bug in
their kernel?

> +		BUG();
> +		break;
> +	case EXIT_REASON_MONITOR_INSTRUCTION:
> +		/*
> +		 * MONITOR is a privileged instruction, can only be executed
> +		 * in ring 0. So we are not supposed to reach here. Raise a
> +		 * warning message.
> +		 */
> +		WARN(1, "MONITOR unexpected #VE Exception\n");
> +		break;
> +	case EXIT_REASON_MWAIT_INSTRUCTION:
> +		/*
> +		 * MWAIT feature is suppressed in firmware and in
> +		 * tdx_early_init() by clearing X86_FEATURE_MWAIT CPU feature
> +		 * flag. Since we are not supposed to reach here, raise a
> +		 * warning message and return -EFAULT.
> +		 */
> +		WARN(1, "MWAIT unexpected #VE Exception\n");
> +		return -EFAULT;
>  	default:
>  		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
>  		return -EFAULT;

This is more of the same.  Don't waste comment bytes telling me we're
not suppose to reach a BUG() or WARN().

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-29 23:23                               ` Andy Lutomirski
  2021-03-29 23:37                                 ` Kuppuswamy, Sathyanarayanan
@ 2021-03-29 23:39                                 ` Sean Christopherson
  1 sibling, 0 replies; 161+ messages in thread
From: Sean Christopherson @ 2021-03-29 23:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On Mon, Mar 29, 2021, Andy Lutomirski wrote:
> 
> > On Mar 29, 2021, at 4:17 PM, Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> > 
> > In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> > are not supported. So handle #VE due to these instructions
> > appropriately.
> 
> Is there something I missed elsewhere in the code that checks CPL?

#GP due to CPL!=0 has priority over VM-Exit, i.e. userspace will get a #GP
directly; there will be no VM-Exit to the TDX Module and thus no #VE.

SDM section "25.1.1 - Relative Priority of Faults and VM Exits".

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-29 23:37                                 ` Kuppuswamy, Sathyanarayanan
@ 2021-03-29 23:42                                   ` Sean Christopherson
  2021-03-29 23:58                                     ` Andy Lutomirski
  0 siblings, 1 reply; 161+ messages in thread
From: Sean Christopherson @ 2021-03-29 23:42 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Andy Lutomirski, Peter Zijlstra, Andy Lutomirski, Dave Hansen,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

On Mon, Mar 29, 2021, Kuppuswamy, Sathyanarayanan wrote:
> 
> 
> On 3/29/21 4:23 PM, Andy Lutomirski wrote:
> > 
> > > On Mar 29, 2021, at 4:17 PM, Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> > > 
> > > In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> > > are not supported. So handle #VE due to these instructions
> > > appropriately.
> > 
> > Is there something I missed elsewhere in the code that checks CPL?
> We don't check for CPL explicitly. But if we are reaching here, then we
> executing these instructions with wrong CPL.

No, if these instructions take a #VE then they were executed at CPL=0.  MONITOR
and MWAIT will #UD without VM-Exit->#VE.  Same for WBINVD, s/#UD/#GP.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-29 23:42                                   ` Sean Christopherson
@ 2021-03-29 23:58                                     ` Andy Lutomirski
  2021-03-30  2:04                                       ` Andi Kleen
  0 siblings, 1 reply; 161+ messages in thread
From: Andy Lutomirski @ 2021-03-29 23:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok, LKML

On Mon, Mar 29, 2021 at 4:42 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Mar 29, 2021, Kuppuswamy, Sathyanarayanan wrote:
> >
> >
> > On 3/29/21 4:23 PM, Andy Lutomirski wrote:
> > >
> > > > On Mar 29, 2021, at 4:17 PM, Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> > > >
> > > > In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> > > > are not supported. So handle #VE due to these instructions
> > > > appropriately.
> > >
> > > Is there something I missed elsewhere in the code that checks CPL?
> > We don't check for CPL explicitly. But if we are reaching here, then we
> > executing these instructions with wrong CPL.
>
> No, if these instructions take a #VE then they were executed at CPL=0.  MONITOR
> and MWAIT will #UD without VM-Exit->#VE.  Same for WBINVD, s/#UD/#GP.

Dare I ask about XSETBV?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-29 23:58                                     ` Andy Lutomirski
@ 2021-03-30  2:04                                       ` Andi Kleen
  2021-03-30  2:58                                         ` Andy Lutomirski
  0 siblings, 1 reply; 161+ messages in thread
From: Andi Kleen @ 2021-03-30  2:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sean Christopherson, Kuppuswamy, Sathyanarayanan, Peter Zijlstra,
	Dave Hansen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, LKML

> > No, if these instructions take a #VE then they were executed at CPL=0.  MONITOR
> > and MWAIT will #UD without VM-Exit->#VE.  Same for WBINVD, s/#UD/#GP.
> 
> Dare I ask about XSETBV?

XGETBV does not cause a #VE, it just works normally. The guest has full
AVX capabilities.

-Andi


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-30  2:04                                       ` Andi Kleen
@ 2021-03-30  2:58                                         ` Andy Lutomirski
  2021-03-30 15:14                                           ` Sean Christopherson
  2021-03-31 21:09                                           ` [PATCH v4 " Kuppuswamy Sathyanarayanan
  0 siblings, 2 replies; 161+ messages in thread
From: Andy Lutomirski @ 2021-03-30  2:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, Sean Christopherson, Kuppuswamy,
	Sathyanarayanan, Peter Zijlstra, Dave Hansen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok, LKML


> On Mar 29, 2021, at 7:04 PM, Andi Kleen <ak@linux.intel.com> wrote:
> 
> 
>> 
>>> No, if these instructions take a #VE then they were executed at CPL=0.  MONITOR
>>> and MWAIT will #UD without VM-Exit->#VE.  Same for WBINVD, s/#UD/#GP.
>> 
>> Dare I ask about XSETBV?
> 
> XGETBV does not cause a #VE, it just works normally. The guest has full
> AVX capabilities.
> 

X *SET* BV

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-27  0:18         ` [PATCH v1 1/1] " Kuppuswamy Sathyanarayanan
  2021-03-27  2:40           ` Andy Lutomirski
@ 2021-03-30  4:56           ` Xiaoyao Li
  2021-03-30 15:00             ` Andi Kleen
  1 sibling, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2021-03-30  4:56 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 3/27/2021 8:18 AM, Kuppuswamy Sathyanarayanan wrote:
> In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> are not supported. So handle #VE due to these instructions as no ops.
> 
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> ---
> 
> Changes since previous series:
>   * Suppressed MWAIT feature as per Andi's comment.
>   * Added warning debug log for MWAIT #VE exception.
> 
>   arch/x86/kernel/tdx.c | 23 +++++++++++++++++++++++
>   1 file changed, 23 insertions(+)
> 
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index e936b2f88bf6..fb7d22b846fc 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -308,6 +308,9 @@ void __init tdx_early_init(void)
>   
>   	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
>   
> +	/* MWAIT is not supported in TDX platform, so suppress it */
> +	setup_clear_cpu_cap(X86_FEATURE_MWAIT);

In fact, MWAIT bit returned by CPUID instruction is zero for TD guest. 
This is enforced by SEAM module.

Do we still need to safeguard it by setup_clear_cpu_cap() here?

> +
>   	tdg_get_info();
>   
>   	pv_ops.irq.safe_halt = tdg_safe_halt;



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-30  4:56           ` [PATCH v1 " Xiaoyao Li
@ 2021-03-30 15:00             ` Andi Kleen
  2021-03-30 15:10               ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Andi Kleen @ 2021-03-30 15:00 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On Tue, Mar 30, 2021 at 12:56:41PM +0800, Xiaoyao Li wrote:
> On 3/27/2021 8:18 AM, Kuppuswamy Sathyanarayanan wrote:
> > In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> > are not supported. So handle #VE due to these instructions as no ops.
> > 
> > Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> > Reviewed-by: Andi Kleen <ak@linux.intel.com>
> > ---
> > 
> > Changes since previous series:
> >   * Suppressed MWAIT feature as per Andi's comment.
> >   * Added warning debug log for MWAIT #VE exception.
> > 
> >   arch/x86/kernel/tdx.c | 23 +++++++++++++++++++++++
> >   1 file changed, 23 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> > index e936b2f88bf6..fb7d22b846fc 100644
> > --- a/arch/x86/kernel/tdx.c
> > +++ b/arch/x86/kernel/tdx.c
> > @@ -308,6 +308,9 @@ void __init tdx_early_init(void)
> >   	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
> > +	/* MWAIT is not supported in TDX platform, so suppress it */
> > +	setup_clear_cpu_cap(X86_FEATURE_MWAIT);
> 
> In fact, MWAIT bit returned by CPUID instruction is zero for TD guest. This
> is enforced by SEAM module.

Good point.
> 
> Do we still need to safeguard it by setup_clear_cpu_cap() here?

I guess it doesn't hurt to do it explicitly.


-Andi

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-30 15:00             ` Andi Kleen
@ 2021-03-30 15:10               ` Dave Hansen
  2021-03-30 17:02                 ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-03-30 15:10 UTC (permalink / raw)
  To: Andi Kleen, Xiaoyao Li
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel

On 3/30/21 8:00 AM, Andi Kleen wrote:
>>> +	/* MWAIT is not supported in TDX platform, so suppress it */
>>> +	setup_clear_cpu_cap(X86_FEATURE_MWAIT);
>> In fact, MWAIT bit returned by CPUID instruction is zero for TD guest. This
>> is enforced by SEAM module.
> Good point.
>> Do we still need to safeguard it by setup_clear_cpu_cap() here?
> I guess it doesn't hurt to do it explicitly.

If this MWAIT behavior (clearing the CPUID bit) is part of the guest
architecture, then it would also be appropriate to WARN() rather than
silently clearing the X86_FEATURE bit.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-30  2:58                                         ` Andy Lutomirski
@ 2021-03-30 15:14                                           ` Sean Christopherson
  2021-03-30 16:37                                             ` Andy Lutomirski
  2021-03-31 21:09                                           ` [PATCH v4 " Kuppuswamy Sathyanarayanan
  1 sibling, 1 reply; 161+ messages in thread
From: Sean Christopherson @ 2021-03-30 15:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Andy Lutomirski, Kuppuswamy, Sathyanarayanan,
	Peter Zijlstra, Dave Hansen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok, LKML

On Mon, Mar 29, 2021, Andy Lutomirski wrote:
> 
> > On Mar 29, 2021, at 7:04 PM, Andi Kleen <ak@linux.intel.com> wrote:
> > 
> > 
> >> 
> >>> No, if these instructions take a #VE then they were executed at CPL=0.  MONITOR
> >>> and MWAIT will #UD without VM-Exit->#VE.  Same for WBINVD, s/#UD/#GP.
> >> 
> >> Dare I ask about XSETBV?
> > 
> > XGETBV does not cause a #VE, it just works normally. The guest has full
> > AVX capabilities.
> > 
> 
> X *SET* BV

Heh, XSETBV also works normally, relative to the features enumerated in CPUID.
XSAVES/XRSTORS support is fixed to '1' in the virtual CPU model.  A subset of
the features managed by XSAVE can be hidden by the VMM, but attempting to enable
unsupported features will #GP (either from hardware or injected by TDX Module),
not #VE.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-30 15:14                                           ` Sean Christopherson
@ 2021-03-30 16:37                                             ` Andy Lutomirski
  2021-03-30 16:57                                               ` Sean Christopherson
  0 siblings, 1 reply; 161+ messages in thread
From: Andy Lutomirski @ 2021-03-30 16:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andi Kleen, Andy Lutomirski, Kuppuswamy, Sathyanarayanan,
	Peter Zijlstra, Dave Hansen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok, LKML


> On Mar 30, 2021, at 8:14 AM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Mon, Mar 29, 2021, Andy Lutomirski wrote:
>> 
>>>> On Mar 29, 2021, at 7:04 PM, Andi Kleen <ak@linux.intel.com> wrote:
>>> 
>>> 
>>>> 
>>>>> No, if these instructions take a #VE then they were executed at CPL=0.  MONITOR
>>>>> and MWAIT will #UD without VM-Exit->#VE.  Same for WBINVD, s/#UD/#GP.
>>>> 
>>>> Dare I ask about XSETBV?
>>> 
>>> XGETBV does not cause a #VE, it just works normally. The guest has full
>>> AVX capabilities.
>>> 
>> 
>> X *SET* BV
> 
> Heh, XSETBV also works normally, relative to the features enumerated in CPUID.
> XSAVES/XRSTORS support is fixed to '1' in the virtual CPU model.  A subset of
> the features managed by XSAVE can be hidden by the VMM, but attempting to enable
> unsupported features will #GP (either from hardware or injected by TDX Module),
> not #VE.

Normally in non-root mode means that every XSETBV results in a VM exit and, IIUC, there’s a buglet in that this happens even if CPL==3.  Does something special happen in TDX or does the exit get reflected back to the guest as a #VE?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-30 16:37                                             ` Andy Lutomirski
@ 2021-03-30 16:57                                               ` Sean Christopherson
  2021-04-07 15:24                                                 ` Andi Kleen
  0 siblings, 1 reply; 161+ messages in thread
From: Sean Christopherson @ 2021-03-30 16:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Andy Lutomirski, Kuppuswamy, Sathyanarayanan,
	Peter Zijlstra, Dave Hansen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok, LKML

On Tue, Mar 30, 2021, Andy Lutomirski wrote:
> 
> > On Mar 30, 2021, at 8:14 AM, Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Mon, Mar 29, 2021, Andy Lutomirski wrote:
> >> 
> >>>> On Mar 29, 2021, at 7:04 PM, Andi Kleen <ak@linux.intel.com> wrote:
> >>> 
> >>> 
> >>>> 
> >>>>> No, if these instructions take a #VE then they were executed at CPL=0.  MONITOR
> >>>>> and MWAIT will #UD without VM-Exit->#VE.  Same for WBINVD, s/#UD/#GP.
> >>>> 
> >>>> Dare I ask about XSETBV?
> >>> 
> >>> XGETBV does not cause a #VE, it just works normally. The guest has full
> >>> AVX capabilities.
> >>> 
> >> 
> >> X *SET* BV
> > 
> > Heh, XSETBV also works normally, relative to the features enumerated in CPUID.
> > XSAVES/XRSTORS support is fixed to '1' in the virtual CPU model.  A subset of
> > the features managed by XSAVE can be hidden by the VMM, but attempting to enable
> > unsupported features will #GP (either from hardware or injected by TDX Module),
> > not #VE.
> 
> Normally in non-root mode means that every XSETBV results in a VM exit and,
> IIUC, there’s a buglet in that this happens even if CPL==3.  Does something
> special happen in TDX or does the exit get reflected back to the guest as a
> #VE?

Hmm, I forgot about that quirk.  I would expect the TDX Module to inject a #GP
for that case.  I can't find anything in the spec that confirms or denies that,
but injecting #VE would be weird and pointless.

Andi/Sathya, the TDX Module spec should be updated to state that XSETBV will
#GP at CPL!=0.  If that's not already the behavior, the module should probably
be changed...

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-30 15:10               ` Dave Hansen
@ 2021-03-30 17:02                 ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-03-30 17:02 UTC (permalink / raw)
  To: Dave Hansen, Andi Kleen, Xiaoyao Li
  Cc: Peter Zijlstra, Andy Lutomirski, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel



On 3/30/21 8:10 AM, Dave Hansen wrote:
> On 3/30/21 8:00 AM, Andi Kleen wrote:
>>>> +	/* MWAIT is not supported in TDX platform, so suppress it */
>>>> +	setup_clear_cpu_cap(X86_FEATURE_MWAIT);
>>> In fact, MWAIT bit returned by CPUID instruction is zero for TD guest. This
>>> is enforced by SEAM module.
>> Good point.
>>> Do we still need to safeguard it by setup_clear_cpu_cap() here?
>> I guess it doesn't hurt to do it explicitly.
> 
> If this MWAIT behavior (clearing the CPUID bit) is part of the guest
> architecture, then it would also be appropriate to WARN() rather than
> silently clearing the X86_FEATURE bit.
Makes sense. It will be useful to find the problem with TDX Module.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-30  2:58                                         ` Andy Lutomirski
  2021-03-30 15:14                                           ` Sean Christopherson
@ 2021-03-31 21:09                                           ` Kuppuswamy Sathyanarayanan
  2021-03-31 21:49                                             ` Dave Hansen
  2021-03-31 21:53                                             ` Sean Christopherson
  1 sibling, 2 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-03-31 21:09 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

As per Guest-Host Communication Interface (GHCI) Specification
for Intel TDX, sec 2.4.1, TDX architecture does not support
MWAIT, MONITOR and WBINVD instructions. So in non-root TDX mode,
if MWAIT/MONITOR instructions are executed with CPL != 0 it will
trigger #UD, and for CPL = 0 case, virtual exception (#VE) is
triggered. WBINVD instruction behavior is also similar to
MWAIT/MONITOR, but for CPL != 0 case, it will trigger #GP instead
of #UD.

To prevent TD guest from using these unsupported instructions,
following measures are adapted:

1. For MWAIT/MONITOR instructions, support for these instructions
are already disabled by TDX module (SEAM). So CPUID flags for
these instructions should be in disabled state. Also, just to be
sure that these instructions are disabled, forcefully unset
X86_FEATURE_MWAIT CPU cap in OS.

2. For WBINVD instruction, we use audit to find the code that uses
this instruction and disable them for TD.

After the above mentioned preventive measures, if TD guest still
execute these instructions, add appropriate warning messages in #VE
handler.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---

Changes since v3:
 * WARN user if SEAM does not disable MONITOR/MWAIT instruction.
 * Fix the commit log and comments to address review comments from
   from Dave & Sean.

Changes since v2:
 * Added BUG() for WBINVD, WARN for MONITOR instructions.
 * Fixed comments as per Dave's review.

Changes since v1:
 * Added WARN() for MWAIT #VE exception.

Changes since previous series:
 * Suppressed MWAIT feature as per Andi's comment.
 * Added warning debug log for MWAIT #VE exception.

 arch/x86/kernel/tdx.c | 44 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e936b2f88bf6..82b411b828a5 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -63,6 +63,14 @@ static inline bool cpuid_has_tdx_guest(void)
 	return true;
 }
 
+static inline bool cpuid_has_mwait(void)
+{
+	if (cpuid_ecx(1) & (1 << (X86_FEATURE_MWAIT % 32)))
+		return true;
+
+	return false;
+}
+
 bool is_tdx_guest(void)
 {
 	return static_cpu_has(X86_FEATURE_TDX_GUEST);
@@ -301,12 +309,25 @@ static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 	return insn.length;
 }
 
+/* Initialize TDX specific CPU capabilities */
+static void __init tdx_cpu_cap_init(void)
+{
+	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+	if (cpuid_has_mwait()) {
+		WARN(1, "TDX Module failed to disable MWAIT\n");
+		/* MWAIT is not supported in TDX platform, so suppress it */
+		setup_clear_cpu_cap(X86_FEATURE_MWAIT);
+	}
+
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
 		return;
 
-	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+	tdx_cpu_cap_init();
 
 	tdg_get_info();
 
@@ -362,6 +383,27 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_EPT_VIOLATION:
 		ve->instr_len = tdg_handle_mmio(regs, ve);
 		break;
+	case EXIT_REASON_WBINVD:
+		/*
+		 * TDX architecture does not support WBINVD instruction.
+		 * Currently, usage of this instruction is prevented by
+		 * disabling the drivers which uses it. So if we still
+		 * reach here, it needs user attention.
+		 */
+		pr_err("TD Guest used unsupported WBINVD instruction\n");
+		BUG();
+		break;
+	case EXIT_REASON_MONITOR_INSTRUCTION:
+	case EXIT_REASON_MWAIT_INSTRUCTION:
+		/*
+		 * MWAIT/MONITOR features are disabled by TDX Module (SEAM)
+		 * and also re-suppressed in kernel by clearing
+		 * X86_FEATURE_MWAIT CPU feature flag in tdx_early_init(). So
+		 * if TD guest still executes MWAIT/MONITOR instruction with
+		 * above suppression, it needs user attention.
+		 */
+		WARN(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n");
+		break;
 	default:
 		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [RFC v1 00/26] Add TDX Guest Support
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (28 preceding siblings ...)
  2021-02-06  6:24 ` [RFC v1 00/26] Add TDX Guest Support sathyanarayanan.kuppuswamy
@ 2021-03-31 21:38 ` Kuppuswamy, Sathyanarayanan
  2021-04-02  0:02 ` Dave Hansen
  2021-04-04 15:02 ` Dave Hansen
  31 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-03-31 21:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

Hi All,

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> Hi All,
> 
> NOTE: This series is not ready for wide public review. It is being
> specifically posted so that Peter Z and other experts on the entry
> code can look for problems with the new exception handler (#VE).
> That's also why x86@ is not being spammed.

We are currently working on a solution to fix the issues raised in
"Add #VE support for TDX guest" patch. While we fix that issue, I would
like to know if there are issues in other patches in this series. So if
possible can you please review other patches in the series and let us
know your comments?.

If you want me to rebase the series on top of v5.12-rcX kernel and repost it,
please let me know.

> 
> Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
> hosts and some physical attacks. This series adds the bare-minimum
> support to run a TDX guest. The host-side support will be submitted
> separately. Also support for advanced TD guest features like attestation
> or debug-mode will be submitted separately. Also, at this point it is not
> secure with some known holes in drivers, and also hasn’t been fully audited
> and fuzzed yet.
> 
> TDX has a lot of similarities to SEV. It enhances confidentiality and
> of guest memory and state (like registers) and includes a new exception
> (#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
> yet), TDX limits the host's ability to effect changes in the guest
> physical address space.
> 
> In contrast to the SEV code in the kernel, TDX guest memory is integrity
> protected and isolated; the host is prevented from accessing guest
> memory (even ciphertext).
> 
> The TDX architecture also includes a new CPU mode called
> Secure-Arbitration Mode (SEAM). The software (TDX module) running in this
> mode arbitrates interactions between host and guest and implements many of
> the guarantees of the TDX architecture.
> 
> Some of the key differences between TD and regular VM is,
> 
> 1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
> 2. A new #VE exception handler is added. The TDX module injects #VE exception
>     to the guest TD in cases of instructions that need to be emulated, disallowed
>     MSR accesses, subset of CPUID leaves, etc.
> 3. By default memory is marked as private, and TD will selectively share it with
>     VMM based on need.
> 4. Remote attestation is supported to enable a third party (either the owner of
>     the workload or a user of the services provided by the workload) to establish
>     that the workload is running on an Intel-TDX-enabled platform located within a
>     TD prior to providing that workload data.
> 
> You can find TDX related documents in the following link.
> 
> https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html
> 
> This RFC series has been reviewed by Dave Hansen.
> 
> Kirill A. Shutemov (16):
>    x86/paravirt: Introduce CONFIG_PARAVIRT_XL
>    x86/tdx: Get TD execution environment information via TDINFO
>    x86/traps: Add #VE support for TDX guest
>    x86/tdx: Add HLT support for TDX guest
>    x86/tdx: Wire up KVM hypercalls
>    x86/tdx: Add MSR support for TDX guest
>    x86/tdx: Handle CPUID via #VE
>    x86/io: Allow to override inX() and outX() implementation
>    x86/tdx: Handle port I/O
>    x86/tdx: Handle in-kernel MMIO
>    x86/mm: Move force_dma_unencrypted() to common code
>    x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
>    x86/tdx: Make pages shared in ioremap()
>    x86/tdx: Add helper to do MapGPA TDVMALL
>    x86/tdx: Make DMA pages shared
>    x86/kvm: Use bounce buffers for TD guest
> 
> Kuppuswamy Sathyanarayanan (6):
>    x86/cpufeatures: Add TDX Guest CPU feature
>    x86/cpufeatures: Add is_tdx_guest() interface
>    x86/tdx: Handle MWAIT, MONITOR and WBINVD
>    ACPI: tables: Add multiprocessor wake-up support
>    x86/topology: Disable CPU hotplug support for TDX platforms.
>    x86/tdx: Introduce INTEL_TDX_GUEST config option
> 
> Sean Christopherson (4):
>    x86/boot: Add a trampoline for APs booting in 64-bit mode
>    x86/boot: Avoid #VE during compressed boot for TDX platforms
>    x86/boot: Avoid unnecessary #VE during boot process
>    x86/tdx: Forcefully disable legacy PIC for TDX guests
> 
>   arch/x86/Kconfig                         |  28 +-
>   arch/x86/boot/compressed/Makefile        |   2 +
>   arch/x86/boot/compressed/head_64.S       |  10 +-
>   arch/x86/boot/compressed/misc.h          |   1 +
>   arch/x86/boot/compressed/pgtable.h       |   2 +-
>   arch/x86/boot/compressed/tdx.c           |  32 ++
>   arch/x86/boot/compressed/tdx_io.S        |   9 +
>   arch/x86/include/asm/apic.h              |   3 +
>   arch/x86/include/asm/asm-prototypes.h    |   1 +
>   arch/x86/include/asm/cpufeatures.h       |   1 +
>   arch/x86/include/asm/idtentry.h          |   4 +
>   arch/x86/include/asm/io.h                |  25 +-
>   arch/x86/include/asm/irqflags.h          |  42 +-
>   arch/x86/include/asm/kvm_para.h          |  21 +
>   arch/x86/include/asm/paravirt.h          |  22 +-
>   arch/x86/include/asm/paravirt_types.h    |   3 +-
>   arch/x86/include/asm/pgtable.h           |   3 +
>   arch/x86/include/asm/realmode.h          |   1 +
>   arch/x86/include/asm/tdx.h               | 114 +++++
>   arch/x86/kernel/Makefile                 |   1 +
>   arch/x86/kernel/acpi/boot.c              |  56 +++
>   arch/x86/kernel/apic/probe_32.c          |   8 +
>   arch/x86/kernel/apic/probe_64.c          |   8 +
>   arch/x86/kernel/head64.c                 |   3 +
>   arch/x86/kernel/head_64.S                |  13 +-
>   arch/x86/kernel/idt.c                    |   6 +
>   arch/x86/kernel/paravirt.c               |   4 +-
>   arch/x86/kernel/pci-swiotlb.c            |   2 +-
>   arch/x86/kernel/smpboot.c                |   5 +
>   arch/x86/kernel/tdx-kvm.c                | 116 +++++
>   arch/x86/kernel/tdx.c                    | 560 +++++++++++++++++++++++
>   arch/x86/kernel/tdx_io.S                 | 143 ++++++
>   arch/x86/kernel/topology.c               |   3 +-
>   arch/x86/kernel/traps.c                  |  73 ++-
>   arch/x86/mm/Makefile                     |   2 +
>   arch/x86/mm/ioremap.c                    |   8 +-
>   arch/x86/mm/mem_encrypt.c                |  74 ---
>   arch/x86/mm/mem_encrypt_common.c         |  83 ++++
>   arch/x86/mm/mem_encrypt_identity.c       |   1 +
>   arch/x86/mm/pat/set_memory.c             |  23 +-
>   arch/x86/realmode/rm/header.S            |   1 +
>   arch/x86/realmode/rm/trampoline_64.S     |  49 +-
>   arch/x86/realmode/rm/trampoline_common.S |   5 +-
>   drivers/acpi/tables.c                    |   9 +
>   include/acpi/actbl2.h                    |  21 +-
>   45 files changed, 1444 insertions(+), 157 deletions(-)
>   create mode 100644 arch/x86/boot/compressed/tdx.c
>   create mode 100644 arch/x86/boot/compressed/tdx_io.S
>   create mode 100644 arch/x86/include/asm/tdx.h
>   create mode 100644 arch/x86/kernel/tdx-kvm.c
>   create mode 100644 arch/x86/kernel/tdx.c
>   create mode 100644 arch/x86/kernel/tdx_io.S
>   create mode 100644 arch/x86/mm/mem_encrypt_common.c
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-31 21:09                                           ` [PATCH v4 " Kuppuswamy Sathyanarayanan
@ 2021-03-31 21:49                                             ` Dave Hansen
  2021-03-31 22:29                                               ` Kuppuswamy, Sathyanarayanan
  2021-03-31 21:53                                             ` Sean Christopherson
  1 sibling, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-03-31 21:49 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 3/31/21 2:09 PM, Kuppuswamy Sathyanarayanan wrote:
> As per Guest-Host Communication Interface (GHCI) Specification
> for Intel TDX, sec 2.4.1, TDX architecture does not support
> MWAIT, MONITOR and WBINVD instructions. So in non-root TDX mode,
> if MWAIT/MONITOR instructions are executed with CPL != 0 it will
> trigger #UD, and for CPL = 0 case, virtual exception (#VE) is
> triggered. WBINVD instruction behavior is also similar to
> MWAIT/MONITOR, but for CPL != 0 case, it will trigger #GP instead
> of #UD.

Could we give it a go to try this in plain English before jumping in and
quoting the exact spec section?  Also, the CPL language is nice and
precise for talking inside Intel, but it's generally easier for me to
read kernel descriptions when we just talk about the kernel.

	When running as a TDX guest, there are a number of existing,
	privileged instructions that do not work.  If the guest kernel
	uses these instructions, the hardware generates a #VE.

Which reminds me...  The SDM says: MWAIT will "#UD ... If
CPUID.01H:ECX.MONITOR[bit 3] = 0".  So, is this an architectural change?
 The guest is *supposed* to see that CPUID bit as 0, so shouldn't it
also get a #UD?  Or is this all so that if SEAM *forgets* to clear the
CPUID bit, the guest gets #VE?

What are we *actually* mitigating here?

Also, FWIW, MWAIT/MONITOR and WBINVD are pretty different beasts.  I
think this would all have been a lot more clear if this would have been
two patches instead of shoehorning them into one.

> To prevent TD guest from using these unsupported instructions,
> following measures are adapted:
> 
> 1. For MWAIT/MONITOR instructions, support for these instructions
> are already disabled by TDX module (SEAM). So CPUID flags for
> these instructions should be in disabled state. Also, just to be
> sure that these instructions are disabled, forcefully unset
> X86_FEATURE_MWAIT CPU cap in OS.
> 
> 2. For WBINVD instruction, we use audit to find the code that uses
> this instruction and disable them for TD.

Really?  Where are those patches?

> +static inline bool cpuid_has_mwait(void)
> +{
> +	if (cpuid_ecx(1) & (1 << (X86_FEATURE_MWAIT % 32)))
> +		return true;
> +
> +	return false;
> +}
> +
>  bool is_tdx_guest(void)
>  {
>  	return static_cpu_has(X86_FEATURE_TDX_GUEST);
> @@ -301,12 +309,25 @@ static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
>  	return insn.length;
>  }
>  
> +/* Initialize TDX specific CPU capabilities */
> +static void __init tdx_cpu_cap_init(void)
> +{
> +	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
> +
> +	if (cpuid_has_mwait()) {
> +		WARN(1, "TDX Module failed to disable MWAIT\n");

WARN(1, "TDX guest enumerated support for MWAIT, disabling it").

> +		/* MWAIT is not supported in TDX platform, so suppress it */
> +		setup_clear_cpu_cap(X86_FEATURE_MWAIT);
> +	}
> +
> +}

Extra newline.

>  void __init tdx_early_init(void)
>  {
>  	if (!cpuid_has_tdx_guest())
>  		return;
>  
> -	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
> +	tdx_cpu_cap_init();
>  
>  	tdg_get_info();
>  
> @@ -362,6 +383,27 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
>  	case EXIT_REASON_EPT_VIOLATION:
>  		ve->instr_len = tdg_handle_mmio(regs, ve);
>  		break;
> +	case EXIT_REASON_WBINVD:
> +		/*
> +		 * TDX architecture does not support WBINVD instruction.
> +		 * Currently, usage of this instruction is prevented by
> +		 * disabling the drivers which uses it. So if we still
> +		 * reach here, it needs user attention.
> +		 */

This comment is awfully vague.  "TDX architecture..." what?  Any CPUs
supporting the TDX architecture?  TDX VMM's?  TDX Guests?

Let's also not waste byte on stating the obvious.  If it didn't need
attention we wouldn't be warning about it, eh?

So, let's halve the size of the comment and say:

		/*
		 * WBINVD is not supported inside TDX guests.  All in-
		 * kernel uses should have been disabled.
		 */

> +		pr_err("TD Guest used unsupported WBINVD instruction\n");
> +		BUG();
> +		break;
> +	case EXIT_REASON_MONITOR_INSTRUCTION:
> +	case EXIT_REASON_MWAIT_INSTRUCTION:
> +		/*
> +		 * MWAIT/MONITOR features are disabled by TDX Module (SEAM)
> +		 * and also re-suppressed in kernel by clearing
> +		 * X86_FEATURE_MWAIT CPU feature flag in tdx_early_init(). So
> +		 * if TD guest still executes MWAIT/MONITOR instruction with
> +		 * above suppression, it needs user attention.
> +		 */

Again, let's trim this down:

		/*
		 * Something in the kernel used MONITOR or MWAIT despite
		 * X86_FEATURE_MWAIT being cleared for TDX guests.
		 */

Rather than naming the function, this makes it quite greppable to find
where it could have *possibly* been cleared.

> +		WARN(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n");
> +		break;
>  	default:
>  		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
>  		return -EFAULT;
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-31 21:09                                           ` [PATCH v4 " Kuppuswamy Sathyanarayanan
  2021-03-31 21:49                                             ` Dave Hansen
@ 2021-03-31 21:53                                             ` Sean Christopherson
  2021-03-31 22:00                                               ` Dave Hansen
  1 sibling, 1 reply; 161+ messages in thread
From: Sean Christopherson @ 2021-03-31 21:53 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, linux-kernel

On Wed, Mar 31, 2021, Kuppuswamy Sathyanarayanan wrote:
> Changes since v3:
>  * WARN user if SEAM does not disable MONITOR/MWAIT instruction.

Why bother?  There are a whole pile of features that are dictated by the TDX
module spec.  MONITOR/MWAIT is about as uninteresting as it gets, e.g. absolute
worst case scenario is the guest kernel crashes, whereas a lot of spec violations
would compromise the security of the guest.

> +	case EXIT_REASON_MONITOR_INSTRUCTION:
> +	case EXIT_REASON_MWAIT_INSTRUCTION:
> +		/*
> +		 * MWAIT/MONITOR features are disabled by TDX Module (SEAM)
> +		 * and also re-suppressed in kernel by clearing
> +		 * X86_FEATURE_MWAIT CPU feature flag in tdx_early_init(). So
> +		 * if TD guest still executes MWAIT/MONITOR instruction with
> +		 * above suppression, it needs user attention.
> +		 */
> +		WARN(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n");

Why not just WARN_ONCE and call it good?

> +		break;
>  	default:
>  		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
>  		return -EFAULT;
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-31 21:53                                             ` Sean Christopherson
@ 2021-03-31 22:00                                               ` Dave Hansen
  2021-03-31 22:06                                                 ` Sean Christopherson
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-03-31 22:00 UTC (permalink / raw)
  To: Sean Christopherson, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On 3/31/21 2:53 PM, Sean Christopherson wrote:
> On Wed, Mar 31, 2021, Kuppuswamy Sathyanarayanan wrote:
>> Changes since v3:
>>  * WARN user if SEAM does not disable MONITOR/MWAIT instruction.
> Why bother?  There are a whole pile of features that are dictated by the TDX
> module spec.  MONITOR/MWAIT is about as uninteresting as it gets, e.g. absolute
> worst case scenario is the guest kernel crashes, whereas a lot of spec violations
> would compromise the security of the guest.

So, what should we do?  In the #VE handler:

	switch (exit_reason) {
	case SOMETHING_WE_HANDLE:
		blah();
		break;
		...
	default:
		pr_err("unhadled #VE, exit reason: %d\n", exit_reason);
		BUG_ON(1);
	}

?

Is this the *ONLY* one of these, or are we going to have another twenty?

If this is the only one, we might as well give a nice string error
message.  If there are twenty more, let's just dump the exit reason,
BUG() and move on with our lives.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-31 22:00                                               ` Dave Hansen
@ 2021-03-31 22:06                                                 ` Sean Christopherson
  2021-03-31 22:11                                                   ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Sean Christopherson @ 2021-03-31 22:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

On Wed, Mar 31, 2021, Dave Hansen wrote:
> On 3/31/21 2:53 PM, Sean Christopherson wrote:
> > On Wed, Mar 31, 2021, Kuppuswamy Sathyanarayanan wrote:
> >> Changes since v3:
> >>  * WARN user if SEAM does not disable MONITOR/MWAIT instruction.
> > Why bother?  There are a whole pile of features that are dictated by the TDX
> > module spec.  MONITOR/MWAIT is about as uninteresting as it gets, e.g. absolute
> > worst case scenario is the guest kernel crashes, whereas a lot of spec violations
> > would compromise the security of the guest.
> 
> So, what should we do?  In the #VE handler:
> 
> 	switch (exit_reason) {
> 	case SOMETHING_WE_HANDLE:
> 		blah();
> 		break;
> 		...
> 	default:
> 		pr_err("unhadled #VE, exit reason: %d\n", exit_reason);
> 		BUG_ON(1);
> 	}
> 
> ?
> 
> Is this the *ONLY* one of these, or are we going to have another twenty?
> 
> If this is the only one, we might as well give a nice string error
> message.  If there are twenty more, let's just dump the exit reason,
> BUG() and move on with our lives.

I've no objection to a nice message in the #VE handler.  What I'm objecting to
is sanity checking the CPUID model provided by the TDX module.  If we don't
trust the TDX module to honor the spec, then there are a huge pile of things
that are far higher priority than MONITOR/MWAIT.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-31 22:06                                                 ` Sean Christopherson
@ 2021-03-31 22:11                                                   ` Dave Hansen
  2021-03-31 22:28                                                     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-03-31 22:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

On 3/31/21 3:06 PM, Sean Christopherson wrote:
> I've no objection to a nice message in the #VE handler.  What I'm objecting to
> is sanity checking the CPUID model provided by the TDX module.  If we don't
> trust the TDX module to honor the spec, then there are a huge pile of things
> that are far higher priority than MONITOR/MWAIT.

In other words:  Don't muck with CPUID or the X86_FEATURE at all.  Don't
check it to comply with the spec.  If something doesn't comply, we'll
get a #VE at *SOME* point.  We don't need to do belt-and-suspenders
programming here.

That sounds sane to me.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-31 22:11                                                   ` Dave Hansen
@ 2021-03-31 22:28                                                     ` Kuppuswamy, Sathyanarayanan
  2021-03-31 22:32                                                       ` Sean Christopherson
  2021-03-31 22:34                                                       ` Dave Hansen
  0 siblings, 2 replies; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-03-31 22:28 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson
  Cc: Peter Zijlstra, Andy Lutomirski, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel



On 3/31/21 3:11 PM, Dave Hansen wrote:
> On 3/31/21 3:06 PM, Sean Christopherson wrote:
>> I've no objection to a nice message in the #VE handler.  What I'm objecting to
>> is sanity checking the CPUID model provided by the TDX module.  If we don't
>> trust the TDX module to honor the spec, then there are a huge pile of things
>> that are far higher priority than MONITOR/MWAIT.
> 
> In other words:  Don't muck with CPUID or the X86_FEATURE at all.  Don't
> check it to comply with the spec.  If something doesn't comply, we'll
> get a #VE at *SOME* point.  We don't need to do belt-and-suspenders
> programming here.
> 
> That sounds sane to me.
But I think there are cases (like MCE) where SEAM does not disable them because
there will be future support for it. We should at-least suppress such features
in kernel.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-31 21:49                                             ` Dave Hansen
@ 2021-03-31 22:29                                               ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-03-31 22:29 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel



On 3/31/21 2:49 PM, Dave Hansen wrote:
> On 3/31/21 2:09 PM, Kuppuswamy Sathyanarayanan wrote:
>> As per Guest-Host Communication Interface (GHCI) Specification
>> for Intel TDX, sec 2.4.1, TDX architecture does not support
>> MWAIT, MONITOR and WBINVD instructions. So in non-root TDX mode,
>> if MWAIT/MONITOR instructions are executed with CPL != 0 it will
>> trigger #UD, and for CPL = 0 case, virtual exception (#VE) is
>> triggered. WBINVD instruction behavior is also similar to
>> MWAIT/MONITOR, but for CPL != 0 case, it will trigger #GP instead
>> of #UD.
> 
> Could we give it a go to try this in plain English before jumping in and
> quoting the exact spec section?  Also, the CPL language is nice and
> precise for talking inside Intel, but it's generally easier for me to
> read kernel descriptions when we just talk about the kernel.
> 
> 	When running as a TDX guest, there are a number of existing,
> 	privileged instructions that do not work.  If the guest kernel
> 	uses these instructions, the hardware generates a #VE.
I will fix it in next version.
> 
> Which reminds me...  The SDM says: MWAIT will "#UD ... If
> CPUID.01H:ECX.MONITOR[bit 3] = 0".  So, is this an architectural change?
>   The guest is *supposed* to see that CPUID bit as 0, so shouldn't it
> also get a #UD?  Or is this all so that if SEAM *forgets* to clear the
> CPUID bit, the guest gets #VE?
AFAIK, we are only concerned about the case where the instruction support
is not disabled by SEAM. For disabled case, it should get #UD.
Sean, can you confirm it?
> 
> What are we *actually* mitigating here?
we add support for #VE, when executing un-supported instruction in TD guest
kernel.
> 
> Also, FWIW, MWAIT/MONITOR and WBINVD are pretty different beasts.  I
> think this would all have been a lot more clear if this would have been
> two patches instead of shoehorning them into one.
Since all of them are unsupported instructions, I have grouped them
together. Even if we split it, there should be some duplication in commit
log (since handling is similar). But let me know if this is a desired
approach. I can split it in two patches.
> 
>> To prevent TD guest from using these unsupported instructions,
>> following measures are adapted:
>>
>> 1. For MWAIT/MONITOR instructions, support for these instructions
>> are already disabled by TDX module (SEAM). So CPUID flags for
>> these instructions should be in disabled state. Also, just to be
>> sure that these instructions are disabled, forcefully unset
>> X86_FEATURE_MWAIT CPU cap in OS.
>>
>> 2. For WBINVD instruction, we use audit to find the code that uses
>> this instruction and disable them for TD.
> 
> Really?  Where are those patches?
For MWAIT/MONITOR, the change is included in the same patch.
For WBINVD, we have will have some patches included in next
series.
> 
>> +static inline bool cpuid_has_mwait(void)
>> +{
>> +	if (cpuid_ecx(1) & (1 << (X86_FEATURE_MWAIT % 32)))
>> +		return true;
>> +
>> +	return false;
>> +}
>> +
>>   bool is_tdx_guest(void)
>>   {
>>   	return static_cpu_has(X86_FEATURE_TDX_GUEST);
>> @@ -301,12 +309,25 @@ static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
>>   	return insn.length;
>>   }
>>   
>> +/* Initialize TDX specific CPU capabilities */
>> +static void __init tdx_cpu_cap_init(void)
>> +{
>> +	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
>> +
>> +	if (cpuid_has_mwait()) {
>> +		WARN(1, "TDX Module failed to disable MWAIT\n");
> 
> WARN(1, "TDX guest enumerated support for MWAIT, disabling it").
will fix it in next version.
> 
>> +		/* MWAIT is not supported in TDX platform, so suppress it */
>> +		setup_clear_cpu_cap(X86_FEATURE_MWAIT);
>> +	}
>> +
>> +}
> 
> Extra newline.
> 
>>   void __init tdx_early_init(void)
>>   {
>>   	if (!cpuid_has_tdx_guest())
>>   		return;
>>   
>> -	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
>> +	tdx_cpu_cap_init();
>>   
>>   	tdg_get_info();
>>   
>> @@ -362,6 +383,27 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
>>   	case EXIT_REASON_EPT_VIOLATION:
>>   		ve->instr_len = tdg_handle_mmio(regs, ve);
>>   		break;
>> +	case EXIT_REASON_WBINVD:
>> +		/*
>> +		 * TDX architecture does not support WBINVD instruction.
>> +		 * Currently, usage of this instruction is prevented by
>> +		 * disabling the drivers which uses it. So if we still
>> +		 * reach here, it needs user attention.
>> +		 */
> 
> This comment is awfully vague.  "TDX architecture..." what?  Any CPUs
> supporting the TDX architecture?  TDX VMM's?  TDX Guests?
> 
> Let's also not waste byte on stating the obvious.  If it didn't need
> attention we wouldn't be warning about it, eh?
> 
> So, let's halve the size of the comment and say:
> 
> 		/*
> 		 * WBINVD is not supported inside TDX guests.  All in-
> 		 * kernel uses should have been disabled.
> 		 */
ok. will fix it next version.
> 
>> +		pr_err("TD Guest used unsupported WBINVD instruction\n");
>> +		BUG();
>> +		break;
>> +	case EXIT_REASON_MONITOR_INSTRUCTION:
>> +	case EXIT_REASON_MWAIT_INSTRUCTION:
>> +		/*
>> +		 * MWAIT/MONITOR features are disabled by TDX Module (SEAM)
>> +		 * and also re-suppressed in kernel by clearing
>> +		 * X86_FEATURE_MWAIT CPU feature flag in tdx_early_init(). So
>> +		 * if TD guest still executes MWAIT/MONITOR instruction with
>> +		 * above suppression, it needs user attention.
>> +		 */
> 
> Again, let's trim this down:
> 
> 		/*
> 		 * Something in the kernel used MONITOR or MWAIT despite
> 		 * X86_FEATURE_MWAIT being cleared for TDX guests.
> 		 */
will fix it next version.
> 
> Rather than naming the function, this makes it quite greppable to find
> where it could have *possibly* been cleared.
> 
>> +		WARN(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n");
I think WARN_ONCE is good enough for this exception. Do you agree?
>> +		break;
>>   	default:
>>   		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
>>   		return -EFAULT;
>>
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-31 22:28                                                     ` Kuppuswamy, Sathyanarayanan
@ 2021-03-31 22:32                                                       ` Sean Christopherson
  2021-03-31 22:34                                                       ` Dave Hansen
  1 sibling, 0 replies; 161+ messages in thread
From: Sean Christopherson @ 2021-03-31 22:32 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, linux-kernel

On Wed, Mar 31, 2021, Kuppuswamy, Sathyanarayanan wrote:
> 
> On 3/31/21 3:11 PM, Dave Hansen wrote:
> > On 3/31/21 3:06 PM, Sean Christopherson wrote:
> > > I've no objection to a nice message in the #VE handler.  What I'm objecting to
> > > is sanity checking the CPUID model provided by the TDX module.  If we don't
> > > trust the TDX module to honor the spec, then there are a huge pile of things
> > > that are far higher priority than MONITOR/MWAIT.
> > 
> > In other words:  Don't muck with CPUID or the X86_FEATURE at all.  Don't
> > check it to comply with the spec.  If something doesn't comply, we'll
> > get a #VE at *SOME* point.  We don't need to do belt-and-suspenders
> > programming here.
> > 
> > That sounds sane to me.
> But I think there are cases (like MCE) where SEAM does not disable them because
> there will be future support for it. We should at-least suppress such features
> in kernel.

MCE is a terrible example, because the TDX behavior for MCE is terrible.
Enumerating MCE as supported but injecting a #GP if the guest attempts to set
CR4.MCE=1 is awful.  I'm all for treating that as a one-off case, with a very
derogatory comment :-)

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-31 22:28                                                     ` Kuppuswamy, Sathyanarayanan
  2021-03-31 22:32                                                       ` Sean Christopherson
@ 2021-03-31 22:34                                                       ` Dave Hansen
  2021-04-01  3:28                                                         ` Andi Kleen
  1 sibling, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-03-31 22:34 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Sean Christopherson
  Cc: Peter Zijlstra, Andy Lutomirski, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On 3/31/21 3:28 PM, Kuppuswamy, Sathyanarayanan wrote:
> 
> On 3/31/21 3:11 PM, Dave Hansen wrote:
>> On 3/31/21 3:06 PM, Sean Christopherson wrote:
>>> I've no objection to a nice message in the #VE handler.  What I'm
>>> objecting to
>>> is sanity checking the CPUID model provided by the TDX module.  If we
>>> don't
>>> trust the TDX module to honor the spec, then there are a huge pile of
>>> things
>>> that are far higher priority than MONITOR/MWAIT.
>>
>> In other words:  Don't muck with CPUID or the X86_FEATURE at all.  Don't
>> check it to comply with the spec.  If something doesn't comply, we'll
>> get a #VE at *SOME* point.  We don't need to do belt-and-suspenders
>> programming here.
>>
>> That sounds sane to me.
> But I think there are cases (like MCE) where SEAM does not disable
> them because there will be future support for it. We should at-least
> suppress such features in kernel.

Specifics, please.

The hardware (and VMMs and SEAM) have ways of telling the guest kernel
what is supported: CPUID.  If it screws up, and the guest gets an
unexpected #VE, so be it.

We don't have all kinds of crazy handling in the kernel's #UD handler
just in case a CPU mis-enumerates a feature and we get a #UD.  We have
to trust the underlying hardware to be sane.  If it isn't, we die a
horrible death as fast as possible.  Why should TDX be any different?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-31 22:34                                                       ` Dave Hansen
@ 2021-04-01  3:28                                                         ` Andi Kleen
  2021-04-01  3:46                                                           ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Andi Kleen @ 2021-04-01  3:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy, Sathyanarayanan, Sean Christopherson, Peter Zijlstra,
	Andy Lutomirski, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

> The hardware (and VMMs and SEAM) have ways of telling the guest kernel
> what is supported: CPUID.  If it screws up, and the guest gets an
> unexpected #VE, so be it.

The main reason for disabling stuff is actually that we don't need
to harden it. All these things are potential attack paths.

> 
> We don't have all kinds of crazy handling in the kernel's #UD handler
> just in case a CPU mis-enumerates a feature and we get a #UD.  We have
> to trust the underlying hardware to be sane.  If it isn't, we die a
> horrible death as fast as possible.  Why should TDX be any different?

That's what the original patch did -- no unnecessary checks -- but reviewers
keep asking for the extra checks, so Sathya added more. We have the not
unusual problem here that reviewers don't agree among themselves.

-Andi

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-04-01  3:28                                                         ` Andi Kleen
@ 2021-04-01  3:46                                                           ` Dave Hansen
  2021-04-01  4:24                                                             ` Andi Kleen
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-01  3:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy, Sathyanarayanan, Sean Christopherson, Peter Zijlstra,
	Andy Lutomirski, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

On 3/31/21 8:28 PM, Andi Kleen wrote:
>> The hardware (and VMMs and SEAM) have ways of telling the guest kernel
>> what is supported: CPUID.  If it screws up, and the guest gets an
>> unexpected #VE, so be it.
> The main reason for disabling stuff is actually that we don't need
> to harden it. All these things are potential attack paths.

Wait, MWAIT is an attack path?  If it were an attack path, wouldn't it
be an attack path that was created from the SEAM layer or the hardware
being broken?  Aren't those two things within the trust boundary?  Do we
harden against other things within the trust boundary?

>> We don't have all kinds of crazy handling in the kernel's #UD handler
>> just in case a CPU mis-enumerates a feature and we get a #UD.  We have
>> to trust the underlying hardware to be sane.  If it isn't, we die a
>> horrible death as fast as possible.  Why should TDX be any different?
> That's what the original patch did -- no unnecessary checks -- but reviewers
> keep asking for the extra checks, so Sathya added more. We have the not
> unusual problem here that reviewers don't agree among themselves.

Getting consensus is a pain in the neck, eh?

It's too bad all the reviewers in the community aren't like all of the
engineers at big companies where everyone always agrees. :)

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-04-01  3:46                                                           ` Dave Hansen
@ 2021-04-01  4:24                                                             ` Andi Kleen
  2021-04-01  4:51                                                               ` [PATCH v5 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Andi Kleen @ 2021-04-01  4:24 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy, Sathyanarayanan, Sean Christopherson, Peter Zijlstra,
	Andy Lutomirski, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

On Wed, Mar 31, 2021 at 08:46:18PM -0700, Dave Hansen wrote:
> On 3/31/21 8:28 PM, Andi Kleen wrote:
> >> The hardware (and VMMs and SEAM) have ways of telling the guest kernel
> >> what is supported: CPUID.  If it screws up, and the guest gets an
> >> unexpected #VE, so be it.
> > The main reason for disabling stuff is actually that we don't need
> > to harden it. All these things are potential attack paths.
> 
> Wait, MWAIT is an attack path?  If it were an attack path, wouldn't it

No MWAIT is not, but lots of other things that can be controlled by the
host are. And that will be a motivation to disable things.

> >> We don't have all kinds of crazy handling in the kernel's #UD handler
> >> just in case a CPU mis-enumerates a feature and we get a #UD.  We have
> >> to trust the underlying hardware to be sane.  If it isn't, we die a
> >> horrible death as fast as possible.  Why should TDX be any different?
> > That's what the original patch did -- no unnecessary checks -- but reviewers
> > keep asking for the extra checks, so Sathya added more. We have the not
> > unusual problem here that reviewers don't agree among themselves.
> 
> Getting consensus is a pain in the neck, eh?

Tt seems more like a circular argument currently.
> 
> It's too bad all the reviewers in the community aren't like all of the
> engineers at big companies where everyone always agrees. :)

I would propose to go back to the original patch without all the extra
checks. I think that's what you're arguing too. IIRC the person
who originally requested extra checks was Andy, if he's ok with 
that too we can do it, so that you guys can finally move on
to the other patches that actually do more than just trivial things.

-Andi

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v5 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-04-01  4:24                                                             ` Andi Kleen
@ 2021-04-01  4:51                                                               ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-01  4:51 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

When running as a TDX guest, there are a number of existing,
privileged instructions that do not work. If the guest kernel
uses these instructions, the hardware generates a #VE.

You can find the list of unsupported instructions in Intel
Trust Domain Extensions (Intel® TDX) Module specification,
sec 9.2.2 and in Guest-Host Communication Interface (GHCI)
Specification for Intel TDX, sec 2.4.1.
   
To prevent TD guest from using these unsupported instructions,
following measures are adapted:
   
1. For MWAIT/MONITOR instructions, support for these
instructions are already disabled by TDX module (SEAM).
So CPUID flags for these instructions should be in disabled
state. Also, just to be sure that these instructions are
disabled, forcefully unset X86_FEATURE_MWAIT CPU cap in OS.
       
2. For WBINVD instruction, we use audit to find the code that
uses this instruction and disable them for TD.

After the above mentioned preventive measures, if TD guests still
execute these instructions, add appropriate warning messages in #VE
handler.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---

Changes since v4:
 * Fixed commit log and comments as per Dave's comments
 * Used WARN_ONCE for MWAIT/MONITOR #VE.
 * Removed X86_FEATURE_MWAIT suppression code.

Changes since v3:
 * WARN user if SEAM does not disable MONITOR/MWAIT instruction.
 * Fix the commit log and comments to address review comments from
   from Dave & Sean.

Changes since v2:
 * Added BUG() for WBINVD, WARN for MONITOR instructions.
 * Fixed comments as per Dave's review.

Changes since v1:
 * Added WARN() for MWAIT #VE exception.

Changes since previous series:
 * Suppressed MWAIT feature as per Andi's comment.
 * Added warning debug log for MWAIT #VE exception.

 arch/x86/kernel/tdx.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e936b2f88bf6..9bc84caf4096 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -362,6 +362,22 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_EPT_VIOLATION:
 		ve->instr_len = tdg_handle_mmio(regs, ve);
 		break;
+	case EXIT_REASON_WBINVD:
+		/*
+		 * WBINVD is not supported inside TDX guests. All in-
+		 * kernel uses should have been disabled.
+		 */
+		pr_err("TD Guest used unsupported WBINVD instruction\n");
+		BUG();
+		break;
+	case EXIT_REASON_MONITOR_INSTRUCTION:
+	case EXIT_REASON_MWAIT_INSTRUCTION:
+		/*
+		 * Something in the kernel used MONITOR or MWAIT despite
+		 * X86_FEATURE_MWAIT being cleared for TDX guests.
+		 */
+		WARN_ONCE(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n");
+		break;
 	default:
 		pr_warn("Unexpected #VE: %d\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [RFC v1 12/26] x86/tdx: Handle in-kernel MMIO
  2021-02-05 23:38 ` [RFC v1 12/26] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
@ 2021-04-01 19:56   ` Dave Hansen
  2021-04-01 22:26     ` Sean Christopherson
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-01 19:56 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
> exit reason.
> 
> For now we only handle subset of instruction that kernel uses for MMIO
> oerations. User-space access triggers SIGBUS.
..
> +	case EXIT_REASON_EPT_VIOLATION:
> +		ve->instr_len = tdx_handle_mmio(regs, ve);
> +		break;

Is MMIO literally the only thing that can cause an EPT violation for TDX
guests?

Forget userspace for a minute.  #VE's from userspace are annoying, but
fine.  We can't control what userspace does.  If an action it takes
causes a #VE in the TDX architecture, tough cookies, the kernel must
handle it and try to recover or kill the app.

The kernel is very different.  We know in advance (must know,
actually...) which instructions might cause exceptions of any kind.
That's why we have exception tables and copy_to/from_user().  That's why
we can handle kernel page faults on userspace, but not inside spinlocks.

Binary-dependent OSes are also very different.  It's going to be natural
for them to want to take existing, signed drivers and use them in TDX
guests.  They might want to do something like this.

But for an OS where we have source for the *ENTIRE* thing, and where we
have a chokepoint for MMIO accesses (arch/x86/include/asm/io.h), it
seems like an *AWFUL* idea to:
1. Have the kernel set up special mappings for I/O memory
2. Kernel generates special instructions to access that memory
3. Kernel faults on that memory
4. Kernel cracks its own special instructions to see what they were
   doing
5. Kernel calls up to host to do the MMIO

Instead of doing 2/3/4, why not just have #2 call up to the host
directly?  This patch seems a very slow, roundabout way to do
paravirtualized MMIO.

BTW, there's already some SEV special-casing in io.h.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code
  2021-02-05 23:38 ` [RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code Kuppuswamy Sathyanarayanan
@ 2021-04-01 20:06   ` Dave Hansen
  2021-04-06 15:37     ` Kirill A. Shutemov
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-01 20:06 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Intel TDX doesn't allow VMM to access guest memory. Any memory that is
> required for communication with VMM suppose to be shared explicitly by

s/suppose to/must/

> setting the bit in page table entry. The shared memory is similar to
> unencrypted memory in AMD SME/SEV terminology.

In addition to setting the page table bit, there's also a dance to go
through to convert the memory.  Please mention the procedure here at
least.  It's very different from SME.

> force_dma_unencrypted() has to return true for TDX guest. Move it out of
> AMD SME code.

You lost me here.  What does force_dma_unencrypted() have to do with
host/guest shared memory?

> Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
> selected by all x86 memory encryption features.

Please also mention what will set it.  I assume TDX guest support will
set this option.  It's probably also worth a sentence to say that
force_dma_unencrypted() will have TDX-specific code added to it.  (It
will, right??)

> This is preparation for TDX changes in DMA code.

Probably best to also mention that this effectively just moves code
around.  This patch should have no functional changes at runtime.


> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 0374d9f262a5..8fa654d61ac2 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1538,14 +1538,18 @@ config X86_CPA_STATISTICS
>  	  helps to determine the effectiveness of preserving large and huge
>  	  page mappings when mapping protections are changed.
>  
> +config X86_MEM_ENCRYPT_COMMON
> +	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
> +	select DYNAMIC_PHYSICAL_MASK
> +	def_bool n
> +
>  config AMD_MEM_ENCRYPT
>  	bool "AMD Secure Memory Encryption (SME) support"
>  	depends on X86_64 && CPU_SUP_AMD
>  	select DMA_COHERENT_POOL
> -	select DYNAMIC_PHYSICAL_MASK
>  	select ARCH_USE_MEMREMAP_PROT
> -	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
>  	select INSTRUCTION_DECODER
> +	select X86_MEM_ENCRYPT_COMMON
>  	help
>  	  Say yes to enable support for the encryption of system memory.
>  	  This requires an AMD processor that supports Secure Memory
> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> index 30a3b30395ad..95e534cffa99 100644
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -257,10 +257,12 @@ static inline void slow_down_io(void)
>  
>  #endif
>  
> -#ifdef CONFIG_AMD_MEM_ENCRYPT
>  #include <linux/jump_label.h>
>  
>  extern struct static_key_false sev_enable_key;

This _looks_ odd.  sev_enable_key went from being under
CONFIG_AMD_MEM_ENCRYPT to being unconditionally referenced.

Could you explain a bit more?

I would have expected it tot at *least* be tied to the new #ifdef.

> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +
>  static inline bool sev_key_active(void)
>  {
>  	return static_branch_unlikely(&sev_enable_key);
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index 5864219221ca..b31cb52bf1bd 100644
> --- a/arch/x86/mm/Makefile
...

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 22/26] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-02-05 23:38 ` [RFC v1 22/26] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Kuppuswamy Sathyanarayanan
@ 2021-04-01 20:13   ` Dave Hansen
  2021-04-06 15:54     ` Kirill A. Shutemov
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-01 20:13 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> tdx_shared_mask() returns the mask that has to be set in page table
> entry to make page shared with VMM.

Needs to be either:

	has to be set in a page table entry
or
	has to be set in page table entries

Pick one, please.  But, the grammar is wrong as-is.


> index 8fa654d61ac2..f10a00c4ad7f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -875,6 +875,7 @@ config INTEL_TDX_GUEST
>  	select PARAVIRT_XL
>  	select X86_X2APIC
>  	select SECURITY_LOCKDOWN_LSM
> +	select X86_MEM_ENCRYPT_COMMON
>  	help
>  	  Provide support for running in a trusted domain on Intel processors
>  	  equipped with Trusted Domain eXtenstions. TDX is an new Intel
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index b46ae140e39b..9bbfe6520ea4 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -104,5 +104,6 @@ long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
>  long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
>  		unsigned long p3, unsigned long p4);
>  
> +phys_addr_t tdx_shared_mask(void);

I know it's redundant, but extern this, please.  Ditto for all the other
declarations in that header.

>  #endif
>  #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index ae37498df981..9681f4a0b4e0 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -41,6 +41,11 @@ bool is_tdx_guest(void)
>  }
>  EXPORT_SYMBOL_GPL(is_tdx_guest);
>  
> +phys_addr_t tdx_shared_mask(void)
> +{
> +	return 1ULL << (td_info.gpa_width - 1);
> +}

A comment would be helpful:

/* The highest bit of a guest physical address is the "sharing" bit */

>  static void tdx_get_info(void)
>  {
>  	register long rcx asm("rcx");
> @@ -56,6 +61,9 @@ static void tdx_get_info(void)
>  
>  	td_info.gpa_width = rcx & GENMASK(5, 0);
>  	td_info.attributes = rdx;
> +
> +	/* Exclude Shared bit from the __PHYSICAL_MASK */
> +	physical_mask &= ~tdx_shared_mask();
>  }

I wish we had all of these 'physical_mask' manipulations in a single
spot.  Can we consolidate these instead of having TDX and SME poke at
them individually?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 23/26] x86/tdx: Make pages shared in ioremap()
  2021-02-05 23:38 ` [RFC v1 23/26] x86/tdx: Make pages shared in ioremap() Kuppuswamy Sathyanarayanan
@ 2021-04-01 20:26   ` Dave Hansen
  2021-04-06 16:00     ` Kirill A. Shutemov
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-01 20:26 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> All ioremap()ed paged that are not backed by normal memory (NONE or
> RESERVED) have to be mapped as shared.

s/paged/pages/


> +/* Make the page accesable by VMM */
> +#define pgprot_tdx_shared(prot) __pgprot(pgprot_val(prot) | tdx_shared_mask())
> +
>  #ifndef __ASSEMBLY__
>  #include <asm/x86_init.h>
>  #include <asm/fpu/xstate.h>
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index 9e5ccc56f8e0..a0ba760866d4 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -87,12 +87,12 @@ static unsigned int __ioremap_check_ram(struct resource *res)
>  }
>  
>  /*
> - * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
> - * there the whole memory is already encrypted.
> + * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or
> + * private in TDX case) because there the whole memory is already encrypted.
>   */

But doesn't this mean that we can't ioremap() normal memory?  I was
somehow expecting that we would need to do this for some host<->guest
communication pages.

>  static unsigned int __ioremap_check_encrypted(struct resource *res)
>  {
> -	if (!sev_active())
> +	if (!sev_active() && !is_tdx_guest())
>  		return 0;
>  
>  	switch (res->desc) {
> @@ -244,6 +244,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
>  	prot = PAGE_KERNEL_IO;
>  	if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
>  		prot = pgprot_encrypted(prot);
> +	else if (is_tdx_guest())
> +		prot = pgprot_tdx_shared(prot);


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 25/26] x86/tdx: Make DMA pages shared
  2021-02-05 23:38 ` [RFC v1 25/26] x86/tdx: Make DMA pages shared Kuppuswamy Sathyanarayanan
@ 2021-04-01 21:01   ` Dave Hansen
  2021-04-06 16:31     ` Kirill A. Shutemov
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-01 21:01 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kai Huang, Sean Christopherson

> +int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private)
> +{
> +	int ret, i;
> +
> +	ret = __tdx_map_gpa(gpa, numpages, private);
> +	if (ret || !private)
> +		return ret;
> +
> +	for (i = 0; i < numpages; i++)
> +		tdx_accept_page(gpa + i*PAGE_SIZE);
> +
> +	return 0;
> +}

Please do something like this:

enum tdx_max_type {
	TDX_MAP_PRIVATE,
	TDX_MAP_SHARED
}

Then, your calls will look like:

	tdx_map_gpa(gpa, nr, TDX_MAP_SHARED);

instead of:

	tdx_map_gpa(gpa, nr, false);

>  static __cpuidle void tdx_halt(void)
>  {
>  	register long r10 asm("r10") = TDVMCALL_STANDARD;
> diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
> index 964e04152417..b6d93b0c5dcf 100644
> --- a/arch/x86/mm/mem_encrypt_common.c
> +++ b/arch/x86/mm/mem_encrypt_common.c
> @@ -15,9 +15,9 @@
>  bool force_dma_unencrypted(struct device *dev)
>  {
>  	/*
> -	 * For SEV, all DMA must be to unencrypted/shared addresses.
> +	 * For SEV and TDX, all DMA must be to unencrypted/shared addresses.
>  	 */
> -	if (sev_active())
> +	if (sev_active() || is_tdx_guest())
>  		return true;
>  
>  	/*
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 16f878c26667..6f23a9816ef0 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -27,6 +27,7 @@
>  #include <asm/proto.h>
>  #include <asm/memtype.h>
>  #include <asm/set_memory.h>
> +#include <asm/tdx.h>
>  
>  #include "../mm_internal.h"
>  
> @@ -1977,8 +1978,8 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
>  	struct cpa_data cpa;
>  	int ret;
>  
> -	/* Nothing to do if memory encryption is not active */
> -	if (!mem_encrypt_active())
> +	/* Nothing to do if memory encryption and TDX are not active */
> +	if (!mem_encrypt_active() && !is_tdx_guest())
>  		return 0;

So, this is starting to look like the "enc" naming is wrong, or at least
a little misleading.   Should we be talking about "protection" or
"guards" or something?

>  	/* Should not be working on unaligned addresses */
> @@ -1988,8 +1989,14 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
>  	memset(&cpa, 0, sizeof(cpa));
>  	cpa.vaddr = &addr;
>  	cpa.numpages = numpages;
> -	cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
> -	cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
> +	if (is_tdx_guest()) {
> +		cpa.mask_set = __pgprot(enc ? 0 : tdx_shared_mask());
> +		cpa.mask_clr = __pgprot(enc ? tdx_shared_mask() : 0);
> +	} else {
> +		cpa.mask_set = __pgprot(enc ? _PAGE_ENC : 0);
> +		cpa.mask_clr = __pgprot(enc ? 0 : _PAGE_ENC);
> +	}

OK, this is too hideous to live.  It sucks that the TDX and SEV/SME bits
are opposite polarity, but oh well.

To me, this gets a lot clearer, and opens up room for commenting if you
do something like:

	if (is_tdx_guest()) {
		mem_enc_bits   = 0;
		mem_plain_bits = tdx_shared_mask();
	} else {
		mem_enc_bits   = _PAGE_ENC;
		mem_plain_bits = 0
	}

	if (enc) {
		cpa.mask_set = mem_enc_bits;
		cpa.mask_clr = mem_plain_bits;  // clear "plain" bits
	} else {
		
		cpa.mask_set = mem_plain_bits;
		cpa.mask_clr = mem_enc_bits;	// clear encryption bits
	}

>  	cpa.pgd = init_mm.pgd;
>  
>  	/* Must avoid aliasing mappings in the highmem code */
> @@ -1999,7 +2006,8 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
>  	/*
>  	 * Before changing the encryption attribute, we need to flush caches.
>  	 */
> -	cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
> +	if (!enc || !is_tdx_guest())
> +		cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));

That "!enc" looks wrong to me.  Caches would need to be flushed whenever
encryption attributes *change*, not just when they are set.

Also, cpa_flush() flushes caches *AND* the TLB.  How does TDX manage to
not need TLB flushes?

>  	ret = __change_page_attr_set_clr(&cpa, 1);
>  
> @@ -2012,6 +2020,11 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
>  	 */
>  	cpa_flush(&cpa, 0);
>  
> +	if (!ret && is_tdx_guest()) {
> +		ret = tdx_map_gpa(__pa(addr), numpages, enc);
> +		// XXX: need to undo on error?
> +	}

Time to fix this stuff up if you want folks to take this series more
seriously.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface
  2021-02-05 23:38 ` [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface Kuppuswamy Sathyanarayanan
@ 2021-04-01 21:08   ` Dave Hansen
  2021-04-01 21:15     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-01 21:08 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> +bool is_tdx_guest(void)
> +{
> +	return static_cpu_has(X86_FEATURE_TDX_GUEST);
> +}

Why do you need is_tdx_guest() as opposed to calling
cpu_feature_enabled(X86_FEATURE_TDX_GUEST) everywhere?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface
  2021-04-01 21:08   ` Dave Hansen
@ 2021-04-01 21:15     ` Kuppuswamy, Sathyanarayanan
  2021-04-01 21:19       ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-01 21:15 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson



On 4/1/21 2:08 PM, Dave Hansen wrote:
> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
>> +bool is_tdx_guest(void)
>> +{
>> +	return static_cpu_has(X86_FEATURE_TDX_GUEST);
>> +}
> 
> Why do you need is_tdx_guest() as opposed to calling
> cpu_feature_enabled(X86_FEATURE_TDX_GUEST) everywhere?

is_tdx_guest() is also implemented/used in compressed
code (which uses native_cpuid calls). I don't think
we can use cpu_feature_enabled(X86_FEATURE_TDX_GUEST) in
compressed code right? Also is_tdx_guest() looks easy
to read and use.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 26/26] x86/kvm: Use bounce buffers for TD guest
  2021-02-05 23:38 ` [RFC v1 26/26] x86/kvm: Use bounce buffers for TD guest Kuppuswamy Sathyanarayanan
@ 2021-04-01 21:17   ` Dave Hansen
  0 siblings, 0 replies; 161+ messages in thread
From: Dave Hansen @ 2021-04-01 21:17 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> TDX doesn't allow to perform DMA access to guest private memory.
> In order for DMA to work properly in TD guest, user SWIOTLB bounce
> buffers.
> 
> Move AMD SEV initialization into common code and adopt for TDX.

This would be best if it can draw a parallel between TDX and SEV.

>  arch/x86/kernel/pci-swiotlb.c    |  2 +-
>  arch/x86/kernel/tdx.c            |  3 +++
>  arch/x86/mm/mem_encrypt.c        | 44 -------------------------------
>  arch/x86/mm/mem_encrypt_common.c | 45 ++++++++++++++++++++++++++++++++
>  4 files changed, 49 insertions(+), 45 deletions(-)
> 
> diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
> index c2cfa5e7c152..020e13749758 100644
> --- a/arch/x86/kernel/pci-swiotlb.c
> +++ b/arch/x86/kernel/pci-swiotlb.c
> @@ -49,7 +49,7 @@ int __init pci_swiotlb_detect_4gb(void)
>  	 * buffers are allocated and used for devices that do not support
>  	 * the addressing range required for the encryption mask.
>  	 */
> -	if (sme_active())
> +	if (sme_active() || is_tdx_guest())
>  		swiotlb = 1;
>  
>  	return swiotlb;
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index f51a19168adc..ccb9401bd706 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -9,6 +9,7 @@
>  #include <asm/vmx.h>
>  #include <asm/insn.h>
>  #include <linux/sched/signal.h> /* force_sig_fault() */
> +#include <linux/swiotlb.h>
>  
>  #ifdef CONFIG_KVM_GUEST
>  #include "tdx-kvm.c"
> @@ -472,6 +473,8 @@ void __init tdx_early_init(void)
>  
>  	legacy_pic = &null_legacy_pic;
>  
> +	swiotlb_force = SWIOTLB_FORCE;

Dumb question time.  But, what is the difference between

	swiotlb = 1;

and

	swiotlb_force = SWIOTLB_FORCE;

It would be nice of the patch to enable me to be a lazy reviewer.

>  	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdx:cpu_hotplug",
>  			  NULL, tdx_cpu_offline_prepare);
>  
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index 11a6a7b3af7e..7fbbb2f3d426 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c

Should we be renaming this to amd_mem_encrypt.c or something?

...
> -	 */
> -	if (sev_active())
> -		static_branch_enable(&sev_enable_key);
> -
> -	print_mem_encrypt_feature_info();
> -}
> -
> diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
> index b6d93b0c5dcf..6f3d90d4d68e 100644
> --- a/arch/x86/mm/mem_encrypt_common.c
> +++ b/arch/x86/mm/mem_encrypt_common.c
> @@ -10,6 +10,7 @@
>  #include <linux/mm.h>
>  #include <linux/mem_encrypt.h>
>  #include <linux/dma-mapping.h>
> +#include <linux/swiotlb.h>
>  
>  /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
>  bool force_dma_unencrypted(struct device *dev)
> @@ -36,3 +37,47 @@ bool force_dma_unencrypted(struct device *dev)
>  
>  	return false;
>  }
> +
> +static void print_mem_encrypt_feature_info(void)
> +{

This function is now named wrong IMNHO.  If it's about AMD only, it
needs AMD in the name.

> +	pr_info("AMD Memory Encryption Features active:");
> +
> +	/* Secure Memory Encryption */
> +	if (sme_active()) {
> +		/*
> +		 * SME is mutually exclusive with any of the SEV
> +		 * features below.
> +		 */
> +		pr_cont(" SME\n");
> +		return;
> +	}
> +
> +	/* Secure Encrypted Virtualization */
> +	if (sev_active())
> +		pr_cont(" SEV");
> +
> +	/* Encrypted Register State */
> +	if (sev_es_active())
> +		pr_cont(" SEV-ES");
> +
> +	pr_cont("\n");
> +}

I'm really tempted to say this needs to be off in arch/x86/kernel/cpu/amd.c

> +/* Architecture __weak replacement functions */
> +void __init mem_encrypt_init(void)
> +{
> +	if (!sme_me_mask && !is_tdx_guest())
> +		return;

The direct check of sme_me_mask looks odd now.  What does this *MEAN*?
Are we looking to jump out of here if no memory encryption is enabled?

I'd much rather this look more like:

	if (!x86_memory_encryption())
		return;

> +	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
> +	swiotlb_update_mem_attributes();
> +	/*
> +	 * With SEV, we need to unroll the rep string I/O instructions.
> +	 */
> +	if (sev_active())
> +		static_branch_enable(&sev_enable_key);
> +
> +	if (!is_tdx_guest())
> +		print_mem_encrypt_feature_info();
> +}


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface
  2021-04-01 21:15     ` Kuppuswamy, Sathyanarayanan
@ 2021-04-01 21:19       ` Dave Hansen
  2021-04-01 22:25         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-01 21:19 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson

On 4/1/21 2:15 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 4/1/21 2:08 PM, Dave Hansen wrote:
>> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
>>> +bool is_tdx_guest(void)
>>> +{
>>> +    return static_cpu_has(X86_FEATURE_TDX_GUEST);
>>> +}
>>
>> Why do you need is_tdx_guest() as opposed to calling
>> cpu_feature_enabled(X86_FEATURE_TDX_GUEST) everywhere?
> 
> is_tdx_guest() is also implemented/used in compressed
> code (which uses native_cpuid calls). I don't think
> we can use cpu_feature_enabled(X86_FEATURE_TDX_GUEST) in
> compressed code right? Also is_tdx_guest() looks easy
> to read and use.

OK, but how many of the is_tdx_guest() uses are in the compressed code?
 Why has its use spread beyond that?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface
  2021-04-01 21:19       ` Dave Hansen
@ 2021-04-01 22:25         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-01 22:25 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson



On 4/1/21 2:19 PM, Dave Hansen wrote:
> On 4/1/21 2:15 PM, Kuppuswamy, Sathyanarayanan wrote:
>> On 4/1/21 2:08 PM, Dave Hansen wrote:
>>> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
>>>> +bool is_tdx_guest(void)
>>>> +{
>>>> +    return static_cpu_has(X86_FEATURE_TDX_GUEST);
>>>> +}
>>>
>>> Why do you need is_tdx_guest() as opposed to calling
>>> cpu_feature_enabled(X86_FEATURE_TDX_GUEST) everywhere?
>>
>> is_tdx_guest() is also implemented/used in compressed
>> code (which uses native_cpuid calls). I don't think
>> we can use cpu_feature_enabled(X86_FEATURE_TDX_GUEST) in
>> compressed code right? Also is_tdx_guest() looks easy
>> to read and use.
> 
> OK, but how many of the is_tdx_guest() uses are in the compressed code?
>   Why has its use spread beyond that?
Its only used in handling in/out instructions in compressed code. But this
code shared with in/out handling on non-compressed code.

#define __out(bwl, bw)                                                  \
do {                                                                    \
         if (is_tdx_guest()) {                                           \
                 asm volatile("call tdg_out" #bwl : :                    \
                                 "a"(value), "d"(port));                 \
         } else {                                                        \
                 asm volatile("out" #bwl " %" #bw "0, %w1" : :           \
                                 "a"(value), "Nd"(port));                \
         }                                                               \
} while (0)
#define __in(bwl, bw)                                                   \
do {                                                                    \
         if (is_tdx_guest()) {                                           \
                 asm volatile("call tdg_in" #bwl :                       \
                                 "=a"(value) : "d"(port));               \
         } else {                                                        \
                 asm volatile("in" #bwl " %w1, %" #bw "0" :              \
                                 "=a"(value) : "Nd"(port));              \
         }                                                               \
} while (0)

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 12/26] x86/tdx: Handle in-kernel MMIO
  2021-04-01 19:56   ` Dave Hansen
@ 2021-04-01 22:26     ` Sean Christopherson
  2021-04-01 22:53       ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Sean Christopherson @ 2021-04-01 22:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

On Thu, Apr 01, 2021, Dave Hansen wrote:
> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
> > exit reason.
> > 
> > For now we only handle subset of instruction that kernel uses for MMIO
> > oerations. User-space access triggers SIGBUS.
> ..
> > +	case EXIT_REASON_EPT_VIOLATION:
> > +		ve->instr_len = tdx_handle_mmio(regs, ve);
> > +		break;
> 
> Is MMIO literally the only thing that can cause an EPT violation for TDX
> guests?

Any EPT Violation, or specifically EPT Violation #VE?  Any memory access can
cause an EPT violation, but the VMM will get the ones that lead to VM-Exit.  The
guest will only get the ones that cause #VE.

Assuming you're asking about #VE... No, any shared memory access can take a #VE
since the VMM controls the shared EPT tables and can clear the SUPPRESS_VE bit 
at any time.  But, if the VMM is friendly, #VE should be limited to MMIO.

There's also the unaccepted private memory case, but if Linux gets an option to
opt out of that, then #VE is limited to shared memory.

> Forget userspace for a minute.  #VE's from userspace are annoying, but
> fine.  We can't control what userspace does.  If an action it takes
> causes a #VE in the TDX architecture, tough cookies, the kernel must
> handle it and try to recover or kill the app.
> 
> The kernel is very different.  We know in advance (must know,
> actually...) which instructions might cause exceptions of any kind.
> That's why we have exception tables and copy_to/from_user().  That's why
> we can handle kernel page faults on userspace, but not inside spinlocks.
> 
> Binary-dependent OSes are also very different.  It's going to be natural
> for them to want to take existing, signed drivers and use them in TDX
> guests.  They might want to do something like this.
> 
> But for an OS where we have source for the *ENTIRE* thing, and where we
> have a chokepoint for MMIO accesses (arch/x86/include/asm/io.h), it
> seems like an *AWFUL* idea to:
> 1. Have the kernel set up special mappings for I/O memory
> 2. Kernel generates special instructions to access that memory
> 3. Kernel faults on that memory
> 4. Kernel cracks its own special instructions to see what they were
>    doing
> 5. Kernel calls up to host to do the MMIO
> 
> Instead of doing 2/3/4, why not just have #2 call up to the host
> directly?  This patch seems a very slow, roundabout way to do
> paravirtualized MMIO.
> 
> BTW, there's already some SEV special-casing in io.h.

I implemented #2 a while back for build_mmio_{read,write}(), I'm guessing the
code is floating around somewhere.  The gotcha is that there are nasty little
pieces of the kernel that don't use the helpers provided by io.h, e.g. the I/O
APIC code likes to access MMIO via a struct overlay, so the compiler is free to
use any instruction that satisfies the constraint.

The I/O APIC can and should be forced off, but dollars to donuts says there are
more special snowflakes lying in wait.  If the kernel uses an allowlist for
drivers, then in theory it should be possible to hunt down all offenders.  But
I think we'll want fallback logic to handle kernel MMIO #VEs, especially if the
kernel needs ISA cracking logic for userspace.  Without fallback logic, any MMIO
#VE from the kernel would be fatal, which is too harsh IMO since the behavior
isn't so obviously wrong, e.g. versus the split lock #AC purge where there's no
legitimate reason for the kernel to generate a split lock.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 12/26] x86/tdx: Handle in-kernel MMIO
  2021-04-01 22:26     ` Sean Christopherson
@ 2021-04-01 22:53       ` Dave Hansen
  0 siblings, 0 replies; 161+ messages in thread
From: Dave Hansen @ 2021-04-01 22:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

On 4/1/21 3:26 PM, Sean Christopherson wrote:
> On Thu, Apr 01, 2021, Dave Hansen wrote:
>> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>
>>> Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
>>> exit reason.
>>>
>>> For now we only handle subset of instruction that kernel uses for MMIO
>>> oerations. User-space access triggers SIGBUS.
>> ..
>>> +	case EXIT_REASON_EPT_VIOLATION:
>>> +		ve->instr_len = tdx_handle_mmio(regs, ve);
>>> +		break;
>>
>> Is MMIO literally the only thing that can cause an EPT violation for TDX
>> guests?
> 
> Any EPT Violation, or specifically EPT Violation #VE?  Any memory access can
> cause an EPT violation, but the VMM will get the ones that lead to VM-Exit.  The
> guest will only get the ones that cause #VE.

I'll rephrase: Is MMIO literally the only thing that can cause us to get
into the EXIT_REASON_EPT_VIOLATION case of the switch() here?

> Assuming you're asking about #VE... No, any shared memory access can take a #VE
> since the VMM controls the shared EPT tables and can clear the SUPPRESS_VE bit 
> at any time.  But, if the VMM is friendly, #VE should be limited to MMIO.

OK, but what are we doing in the case of unfriendly VMMs?  What does
*this* code do as-is, and where do we want to take it?

From the _looks_ of this patch, tdx_handle_mmio() is the be all end all
solution to all EXIT_REASON_EPT_VIOLATION events.

>> But for an OS where we have source for the *ENTIRE* thing, and where we
>> have a chokepoint for MMIO accesses (arch/x86/include/asm/io.h), it
>> seems like an *AWFUL* idea to:
>> 1. Have the kernel set up special mappings for I/O memory
>> 2. Kernel generates special instructions to access that memory
>> 3. Kernel faults on that memory
>> 4. Kernel cracks its own special instructions to see what they were
>>    doing
>> 5. Kernel calls up to host to do the MMIO
>>
>> Instead of doing 2/3/4, why not just have #2 call up to the host
>> directly?  This patch seems a very slow, roundabout way to do
>> paravirtualized MMIO.
>>
>> BTW, there's already some SEV special-casing in io.h.
> 
> I implemented #2 a while back for build_mmio_{read,write}(), I'm guessing the
> code is floating around somewhere.  The gotcha is that there are nasty little
> pieces of the kernel that don't use the helpers provided by io.h, e.g. the I/O
> APIC code likes to access MMIO via a struct overlay, so the compiler is free to
> use any instruction that satisfies the constraint.

So, there aren't an infinite number of these.  It's also 100% possible
to add some tooling to the kernel today to help you find these.  You
could also have added tooling to KVM hosts to help find these.

Folks are *also* saying that we'll need a driver audit just to trust
that drivers aren't vulnerable to attacks from devices or from the host.
 This can quite easily be a part of that effort.

> The I/O APIC can and should be forced off, but dollars to donuts says there are
> more special snowflakes lying in wait.  If the kernel uses an allowlist for
> drivers, then in theory it should be possible to hunt down all offenders.  But
> I think we'll want fallback logic to handle kernel MMIO #VEs, especially if the
> kernel needs ISA cracking logic for userspace.  Without fallback logic, any MMIO
> #VE from the kernel would be fatal, which is too harsh IMO since the behavior
> isn't so obviously wrong, e.g. versus the split lock #AC purge where there's no
> legitimate reason for the kernel to generate a split lock.

I'll buy that this patch is convenient for *debugging*.  It helped folks
bootstrap the TDX support and get it going.

IMNHO, if a driver causes a #VE, it's a bug.  Just like if it goes off
the rails and touches bad memory and #GP's or #PF's.

Are there any printk's in the #VE handler?  Guess what those do.  Print
to the console.  Guess what consoles do.  MMIO.  You can't get away from
doing audits of the console drivers.  Sure, you can go make #VE special,
like NMIs, but that's not going to be fun.  At least the guest doesn't
have to deal with the fatality of a nested #VE, but it's still fatal.

I just don't like us pretending that we're Windows and have no control
over the code we run and throwing up our hands.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 00/26] Add TDX Guest Support
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (29 preceding siblings ...)
  2021-03-31 21:38 ` Kuppuswamy, Sathyanarayanan
@ 2021-04-02  0:02 ` Dave Hansen
  2021-04-02  2:48   ` Andi Kleen
  2021-04-04 15:02 ` Dave Hansen
  31 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-02  0:02 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
> hosts and some physical attacks. This series adds the bare-minimum
> support to run a TDX guest. The host-side support will be submitted
> separately. Also support for advanced TD guest features like attestation
> or debug-mode will be submitted separately. Also, at this point it is not
> secure with some known holes in drivers, and also hasn’t been fully audited
> and fuzzed yet.

I want to hear a lot more about this driver model.

I've heard things like "we need to harden the drivers" or "we need to do
audits" and that drivers might be "whitelisted".

What are we talking about specifically?  Which drivers?  How many
approximately?  Just virtio?  Are there any "real" hardware drivers
involved like how QEMU emulates an e1000 or rtl8139 device?  What about
the APIC or HPET?

How broadly across the kernel is this going to go?

Without something concrete, it's really hard to figure out if we should
go full-blown paravirtualized MMIO, or do something like the #VE
trapping that's in this series currently.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 00/26] Add TDX Guest Support
  2021-04-02  0:02 ` Dave Hansen
@ 2021-04-02  2:48   ` Andi Kleen
  2021-04-02 15:27     ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Andi Kleen @ 2021-04-02  2:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel

> I've heard things like "we need to harden the drivers" or "we need to do
> audits" and that drivers might be "whitelisted".

The basic driver allow listing patches are already in the repository,
but not currently posted or complete:

https://github.com/intel/tdx/commits/guest

> 
> What are we talking about specifically?  Which drivers?  How many
> approximately?  Just virtio?  

Right now just virtio, later other drivers that hypervisors need.

> Are there any "real" hardware drivers
> involved like how QEMU emulates an e1000 or rtl8139 device?  

Not currently (but some later hypervisor might rely on one of those)

> What about
> the APIC or HPET?

No IO-APIC, but the local APIC. No HPET.

> 
> How broadly across the kernel is this going to go?

Not very broadly for drivers.

> 
> Without something concrete, it's really hard to figure out if we should
> go full-blown paravirtualized MMIO, or do something like the #VE
> trapping that's in this series currently.

As Sean says the concern about MMIO is less drivers (which should
be generally ok if they work on other architectures which require MMIO
magic), but other odd code that only ran on x86 before.

I really don't understand your crusade against #VE. It really
isn't that bad if we can avoid the few corner cases.

For me it would seem wrong to force all MMIO for all drivers to some
complicated paravirt construct, blowing up code side everywhere
and adding complicated self modifying code, when it's only needed for very
few drivers. But we also don't want to patch every MMIO to be special cased
even those few drivers.

#VE based MMIO avoids all that cleanly while being nicely non intrusive.

-Andi


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 00/26] Add TDX Guest Support
  2021-04-02  2:48   ` Andi Kleen
@ 2021-04-02 15:27     ` Dave Hansen
  2021-04-02 21:32       ` Andi Kleen
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-02 15:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/1/21 7:48 PM, Andi Kleen wrote:
>> I've heard things like "we need to harden the drivers" or "we need to do
>> audits" and that drivers might be "whitelisted".
> 
> The basic driver allow listing patches are already in the repository,
> but not currently posted or complete:
> 
> https://github.com/intel/tdx/commits/guest

That lists exactly 8 ids:

> 	{ PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1000 }, /* Virtio NET */
> 	{ PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1001 }, /* Virtio block */
> 	{ PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1003 }, /* Virtio console */
> 	{ PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1009 }, /* Virtio FS */
> 
> 	{ PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1041 }, /* Virtio 1.0 NET */
> 	{ PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1042 }, /* Virtio 1.0 block */
> 	{ PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1043 }, /* Virtio 1.0 console */
> 	{ PCI_VENDOR_ID_REDHAT_QUMRANET, 0x1049 }, /* Virtio 1.0 FS */

How many places do those 8 drivers touch MMIO?

>> Are there any "real" hardware drivers
>> involved like how QEMU emulates an e1000 or rtl8139 device?  
> 
> Not currently (but some later hypervisor might rely on one of those)
> 
>> What about
>> the APIC or HPET?
> 
> No IO-APIC, but the local APIC. No HPET.

Sean seemed worried about other x86-specific oddities.  Are there any
more, or is the local APIC the only non-driver MMIO?

>> Without something concrete, it's really hard to figure out if we should
>> go full-blown paravirtualized MMIO, or do something like the #VE
>> trapping that's in this series currently.
> 
> As Sean says the concern about MMIO is less drivers (which should
> be generally ok if they work on other architectures which require MMIO
> magic), but other odd code that only ran on x86 before.
> 
> I really don't understand your crusade against #VE. It really
> isn't that bad if we can avoid the few corner cases.

The problem isn't with #VE per se.  It's with posting a series that
masquerades as a full solution while *NOT* covering or even enumerating
the corner cases.  That's exactly what happened with #VE to start with:
it was implemented in a way that exposed the kernel to #VE during the
syscall gap (and the SWAPGS gap for that matter).

So, I'm pushing for a design that won't have corner cases.  If MMIO
itself is disallowed, then we can scream about *any* detected MMIO.
Then, there's no worry about #VE nesting.  No #VE, no #VE nesting.  We
don't even have to consider if #VE needs NMI-like semantics.

> For me it would seem wrong to force all MMIO for all drivers to some
> complicated paravirt construct, blowing up code side everywhere
> and adding complicated self modifying code, when it's only needed for very
> few drivers. But we also don't want to patch every MMIO to be special cased
> even those few drivers.
> 
> #VE based MMIO avoids all that cleanly while being nicely non intrusive.

But, we're not selling used cars here.  Using #VE is has downsides.
Let's not pretend that it doesn't.

If we go this route, what are the rules and restrictions?  Do we have to
say "no MMIO in #VE"?

I'm really the most worried about the console.  Consoles and NMIs have
been a nightmare, IIRC.  Doesn't this just make it *WORSE* because now
the deepest reaches of the console driver are guaranteed to #VE?

Which brings up another related point: How do you debug TD guests?  Does
earlyprintk work?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 00/26] Add TDX Guest Support
  2021-04-02 15:27     ` Dave Hansen
@ 2021-04-02 21:32       ` Andi Kleen
  2021-04-03 16:26         ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Andi Kleen @ 2021-04-02 21:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel

> If we go this route, what are the rules and restrictions?  Do we have to
> say "no MMIO in #VE"?

All we have to say is "No MMIO in #VE before getting thd TDVEINFO arguments"
After that it can nest without problems.

If you nest before that the TDX will cause a triple fault.

The code that cannot do it is a few lines in the early handler which
runs with interrupts off.

The TDX module also makes sure to not inject NMIs while we're in
that region, so NMIs are of no concern.

That was the whole point of avoiding the system call gap problem. We don't
need to make it IST, so it can nest.

I'm not aware of any other special rules.

> Which brings up another related point: How do you debug TD guests?  Does
> earlyprintk work?

Today it works actually because serial ports are allowed. But I expect it to
be closed eventually because serial code is a lot of code to audit. 
But you can always disable the filtering with a command line option and
then it will always work for debugging.

-Andi

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 00/26] Add TDX Guest Support
  2021-04-02 21:32       ` Andi Kleen
@ 2021-04-03 16:26         ` Dave Hansen
  2021-04-03 17:28           ` Andi Kleen
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-03 16:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/2/21 2:32 PM, Andi Kleen wrote:
>> If we go this route, what are the rules and restrictions?  Do we have to
>> say "no MMIO in #VE"?
> 
> All we have to say is "No MMIO in #VE before getting thd TDVEINFO arguments"
> After that it can nest without problems.

Well, not exactly.  You still can't do things that will could cause a n
unbounded recusive #VE.

It doesn't seem *that* far fetched to think that someone might try to
defer some work or dump data to the console.

> If you nest before that the TDX will cause a triple fault.
> 
> The code that cannot do it is a few lines in the early handler which
> runs with interrupts off.

>> Which brings up another related point: How do you debug TD guests?  Does
>> earlyprintk work?
> 
> Today it works actually because serial ports are allowed. But I expect it to
> be closed eventually because serial code is a lot of code to audit. 
> But you can always disable the filtering with a command line option and
> then it will always work for debugging.

Do we need a TDX-specific earlyprintk?  I would imagine it's pretty easy
to implement.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 00/26] Add TDX Guest Support
  2021-04-03 16:26         ` Dave Hansen
@ 2021-04-03 17:28           ` Andi Kleen
  0 siblings, 0 replies; 161+ messages in thread
From: Andi Kleen @ 2021-04-03 17:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel

On Sat, Apr 03, 2021 at 09:26:24AM -0700, Dave Hansen wrote:
> On 4/2/21 2:32 PM, Andi Kleen wrote:
> >> If we go this route, what are the rules and restrictions?  Do we have to
> >> say "no MMIO in #VE"?
> > 
> > All we have to say is "No MMIO in #VE before getting thd TDVEINFO arguments"
> > After that it can nest without problems.
> 
> Well, not exactly.  You still can't do things that will could cause a n
> unbounded recusive #VE.

> It doesn't seem *that* far fetched to think that someone might try to
> defer some work or dump data to the console.

I believe the main console code has reentry protection.

I'm not sure about early_printk (with keep), buf it that's the case
it probably should be fixed anyways. I can take a look at that.

Not sure why deferring something would cause another #VE?

 
> > If you nest before that the TDX will cause a triple fault.
> > 
> > The code that cannot do it is a few lines in the early handler which
> > runs with interrupts off.
> 
> >> Which brings up another related point: How do you debug TD guests?  Does
> >> earlyprintk work?
> > 
> > Today it works actually because serial ports are allowed. But I expect it to
> > be closed eventually because serial code is a lot of code to audit. 
> > But you can always disable the filtering with a command line option and
> > then it will always work for debugging.
> 
> Do we need a TDX-specific earlyprintk?  I would imagine it's pretty easy
> to implement.

Don't see a need at this point, the existing mechanisms work.

Maybe if we ever have a problem that only happen in lockdown *and* happens
early, but that's not very likely since lock down primarily changes code
behavior later.

There are also other debug mechanisms for such cases: in TDX if you configure
the TD for debug mode supports using the gdb stub on the hypervisor.

-Andi


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 00/26] Add TDX Guest Support
  2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
                   ` (30 preceding siblings ...)
  2021-04-02  0:02 ` Dave Hansen
@ 2021-04-04 15:02 ` Dave Hansen
  2021-04-12 17:24   ` Dan Williams
  31 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-04 15:02 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

It occurred to me that I've been doing a lot of digging in the TDX spec
lately.  I think we can all agree that the "Architecture Specification"
is not the world's easiest, most disgestable reading.  It's hard to
figure out what the Linux relation to the spec is.

One bit of Documentation we need for TDX is a description of the memory
states.  For instance, it would be nice to spell out the different
classes of memory, how they are selected, who selects them, and who
enforces the selection.  What faults are generated on each type and who
can induce those?

For instance:

TD-Private memory is selected by the Shared/Private bit in Present=1
guest PTEs.  When the hardware page walker sees that bit, it walk the
secure EPT.  The secure EPT entries can only be written by the TDX
module, although they are written at the request of the VMM.  The TDX
module enforces rules like ensuring that the memory mapped by secure EPT
is not mapped multiple times.  The VMM can remove entries.  From the
guest perspective, all private memory accesses are either successful, or
result in a #VE.  Private memory access does not cause VMExits.

Would that be useful to folks?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code
  2021-04-01 20:06   ` Dave Hansen
@ 2021-04-06 15:37     ` Kirill A. Shutemov
  2021-04-06 16:11       ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Kirill A. Shutemov @ 2021-04-06 15:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On Thu, Apr 01, 2021 at 01:06:29PM -0700, Dave Hansen wrote:
> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Intel TDX doesn't allow VMM to access guest memory. Any memory that is
> > required for communication with VMM suppose to be shared explicitly by
> 
> s/suppose to/must/

Right.

> > setting the bit in page table entry. The shared memory is similar to
> > unencrypted memory in AMD SME/SEV terminology.
> 
> In addition to setting the page table bit, there's also a dance to go
> through to convert the memory.  Please mention the procedure here at
> least.  It's very different from SME.

"
  After setting the shared bit, the conversion must be completed with
  MapGPA TDVMALL. The call informs VMM about the conversion and makes it
  remove the GPA from the S-EPT mapping.
"

> > force_dma_unencrypted() has to return true for TDX guest. Move it out of
> > AMD SME code.
> 
> You lost me here.  What does force_dma_unencrypted() have to do with
> host/guest shared memory?

"
  AMD SEV makes force_dma_unencrypted() return true which triggers
  set_memory_decrypted() calls on all DMA allocations. TDX will use the
  same code path to make DMA allocations shared.
"

> > Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
> > selected by all x86 memory encryption features.
> 
> Please also mention what will set it.  I assume TDX guest support will
> set this option.  It's probably also worth a sentence to say that
> force_dma_unencrypted() will have TDX-specific code added to it.  (It
> will, right??)

"
  Only AMD_MEM_ENCRYPT uses the option now. TDX will be the second one.
"

> > This is preparation for TDX changes in DMA code.
> 
> Probably best to also mention that this effectively just moves code
> around.  This patch should have no functional changes at runtime.

Isn't it what the subject says? :P

> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 0374d9f262a5..8fa654d61ac2 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1538,14 +1538,18 @@ config X86_CPA_STATISTICS
> >  	  helps to determine the effectiveness of preserving large and huge
> >  	  page mappings when mapping protections are changed.
> >  
> > +config X86_MEM_ENCRYPT_COMMON
> > +	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
> > +	select DYNAMIC_PHYSICAL_MASK
> > +	def_bool n
> > +
> >  config AMD_MEM_ENCRYPT
> >  	bool "AMD Secure Memory Encryption (SME) support"
> >  	depends on X86_64 && CPU_SUP_AMD
> >  	select DMA_COHERENT_POOL
> > -	select DYNAMIC_PHYSICAL_MASK
> >  	select ARCH_USE_MEMREMAP_PROT
> > -	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
> >  	select INSTRUCTION_DECODER
> > +	select X86_MEM_ENCRYPT_COMMON
> >  	help
> >  	  Say yes to enable support for the encryption of system memory.
> >  	  This requires an AMD processor that supports Secure Memory
> > diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> > index 30a3b30395ad..95e534cffa99 100644
> > --- a/arch/x86/include/asm/io.h
> > +++ b/arch/x86/include/asm/io.h
> > @@ -257,10 +257,12 @@ static inline void slow_down_io(void)
> >  
> >  #endif
> >  
> > -#ifdef CONFIG_AMD_MEM_ENCRYPT
> >  #include <linux/jump_label.h>
> >  
> >  extern struct static_key_false sev_enable_key;
> 
> This _looks_ odd.  sev_enable_key went from being under
> CONFIG_AMD_MEM_ENCRYPT to being unconditionally referenced.

Not referenced, but declared.

> Could you explain a bit more?
> 
> I would have expected it tot at *least* be tied to the new #ifdef.

Looks like a fixup got folded into a wrong patch. It supposed to be in
"x86/kvm: Use bounce buffers for TD guest".

This declaration allows to get away without any #ifdefs in
mem_encrypt_init() when !CONFIG_AMD_MEM_ENCRYPT: sev_active() is
false at compile-time and sev_enable_key never referenced.

Sathya, could move it to the right patch?

> > +#ifdef CONFIG_AMD_MEM_ENCRYPT
> > +
> >  static inline bool sev_key_active(void)
> >  {
> >  	return static_branch_unlikely(&sev_enable_key);
> > diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> > index 5864219221ca..b31cb52bf1bd 100644
> > --- a/arch/x86/mm/Makefile
> ...

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 22/26] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-04-01 20:13   ` Dave Hansen
@ 2021-04-06 15:54     ` Kirill A. Shutemov
  2021-04-06 16:12       ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Kirill A. Shutemov @ 2021-04-06 15:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On Thu, Apr 01, 2021 at 01:13:16PM -0700, Dave Hansen wrote:
> > @@ -56,6 +61,9 @@ static void tdx_get_info(void)
> >  
> >  	td_info.gpa_width = rcx & GENMASK(5, 0);
> >  	td_info.attributes = rdx;
> > +
> > +	/* Exclude Shared bit from the __PHYSICAL_MASK */
> > +	physical_mask &= ~tdx_shared_mask();
> >  }
> 
> I wish we had all of these 'physical_mask' manipulations in a single
> spot.  Can we consolidate these instead of having TDX and SME poke at
> them individually?

SME has to do it very early -- from __startup_64() -- as it sets the bit
on all memory, except what used for communication. TDX can postpone as we
don't need any shared mapping in very early boot.

Basically, to make it done from the same place we would need to move TDX
enumeration earlier into boot. It's risky: everything is more fragile
there.

I would rather keep it as is. We should be fine as long as we only allow
to clear bits from the mask.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 23/26] x86/tdx: Make pages shared in ioremap()
  2021-04-01 20:26   ` Dave Hansen
@ 2021-04-06 16:00     ` Kirill A. Shutemov
  2021-04-06 16:14       ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Kirill A. Shutemov @ 2021-04-06 16:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On Thu, Apr 01, 2021 at 01:26:23PM -0700, Dave Hansen wrote:
> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > All ioremap()ed paged that are not backed by normal memory (NONE or
> > RESERVED) have to be mapped as shared.
> 
> s/paged/pages/
> 
> 
> > +/* Make the page accesable by VMM */
> > +#define pgprot_tdx_shared(prot) __pgprot(pgprot_val(prot) | tdx_shared_mask())
> > +
> >  #ifndef __ASSEMBLY__
> >  #include <asm/x86_init.h>
> >  #include <asm/fpu/xstate.h>
> > diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> > index 9e5ccc56f8e0..a0ba760866d4 100644
> > --- a/arch/x86/mm/ioremap.c
> > +++ b/arch/x86/mm/ioremap.c
> > @@ -87,12 +87,12 @@ static unsigned int __ioremap_check_ram(struct resource *res)
> >  }
> >  
> >  /*
> > - * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
> > - * there the whole memory is already encrypted.
> > + * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or
> > + * private in TDX case) because there the whole memory is already encrypted.
> >   */
> 
> But doesn't this mean that we can't ioremap() normal memory?

It's not allowed anyway: see (io_desc.flags & IORES_MAP_SYSTEM_RAM) in the
__ioremap_caller().


> I was somehow expecting that we would need to do this for some
> host<->guest communication pages.

It goes though DMA API, not ioremap().

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code
  2021-04-06 15:37     ` Kirill A. Shutemov
@ 2021-04-06 16:11       ` Dave Hansen
  2021-04-06 16:37         ` Kirill A. Shutemov
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-06 16:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 4/6/21 8:37 AM, Kirill A. Shutemov wrote:
> On Thu, Apr 01, 2021 at 01:06:29PM -0700, Dave Hansen wrote:
>> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>
>>> Intel TDX doesn't allow VMM to access guest memory. Any memory that is
>>> required for communication with VMM suppose to be shared explicitly by
>>
>> s/suppose to/must/
> 
> Right.
> 
>>> setting the bit in page table entry. The shared memory is similar to
>>> unencrypted memory in AMD SME/SEV terminology.
>>
>> In addition to setting the page table bit, there's also a dance to go
>> through to convert the memory.  Please mention the procedure here at
>> least.  It's very different from SME.
> 
> "
>   After setting the shared bit, the conversion must be completed with
>   MapGPA TDVMALL. The call informs VMM about the conversion and makes it
>   remove the GPA from the S-EPT mapping.
> "

Where does the TDX module fit in here?

>>> force_dma_unencrypted() has to return true for TDX guest. Move it out of
>>> AMD SME code.
>>
>> You lost me here.  What does force_dma_unencrypted() have to do with
>> host/guest shared memory?
> 
> "
>   AMD SEV makes force_dma_unencrypted() return true which triggers
>   set_memory_decrypted() calls on all DMA allocations. TDX will use the
>   same code path to make DMA allocations shared.
> "

SEV assumes that I/O devices can only do DMA to "decrypted" physical
addresses without the C-bit set.  In order for the CPU to interact with
this memory, the CPU needs a decrypted mapping.

TDX is similar.  TDX architecturally prevents access to private guest
memory by anything other than the guest itself.  This means that any DMA
buffers must be shared.

Right?

>>> Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
>>> selected by all x86 memory encryption features.
>>
>> Please also mention what will set it.  I assume TDX guest support will
>> set this option.  It's probably also worth a sentence to say that
>> force_dma_unencrypted() will have TDX-specific code added to it.  (It
>> will, right??)
> 
> "
>   Only AMD_MEM_ENCRYPT uses the option now. TDX will be the second one.
> "
> 
>>> This is preparation for TDX changes in DMA code.
>>
>> Probably best to also mention that this effectively just moves code
>> around.  This patch should have no functional changes at runtime.
> 
> Isn't it what the subject says? :P

Yes, but please mention it explicitly.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 22/26] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-04-06 15:54     ` Kirill A. Shutemov
@ 2021-04-06 16:12       ` Dave Hansen
  0 siblings, 0 replies; 161+ messages in thread
From: Dave Hansen @ 2021-04-06 16:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 4/6/21 8:54 AM, Kirill A. Shutemov wrote:
> On Thu, Apr 01, 2021 at 01:13:16PM -0700, Dave Hansen wrote:
>>> @@ -56,6 +61,9 @@ static void tdx_get_info(void)
>>>  
>>>  	td_info.gpa_width = rcx & GENMASK(5, 0);
>>>  	td_info.attributes = rdx;
>>> +
>>> +	/* Exclude Shared bit from the __PHYSICAL_MASK */
>>> +	physical_mask &= ~tdx_shared_mask();
>>>  }
>> I wish we had all of these 'physical_mask' manipulations in a single
>> spot.  Can we consolidate these instead of having TDX and SME poke at
>> them individually?
> SME has to do it very early -- from __startup_64() -- as it sets the bit
> on all memory, except what used for communication. TDX can postpone as we
> don't need any shared mapping in very early boot.
> 
> Basically, to make it done from the same place we would need to move TDX
> enumeration earlier into boot. It's risky: everything is more fragile
> there.
> 
> I would rather keep it as is. We should be fine as long as we only allow
> to clear bits from the mask.

I'll buy that.  Could you mention it in the changelog, please?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 23/26] x86/tdx: Make pages shared in ioremap()
  2021-04-06 16:00     ` Kirill A. Shutemov
@ 2021-04-06 16:14       ` Dave Hansen
  0 siblings, 0 replies; 161+ messages in thread
From: Dave Hansen @ 2021-04-06 16:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 4/6/21 9:00 AM, Kirill A. Shutemov wrote:
>>> --- a/arch/x86/mm/ioremap.c
>>> +++ b/arch/x86/mm/ioremap.c
>>> @@ -87,12 +87,12 @@ static unsigned int __ioremap_check_ram(struct resource *res)
>>>  }
>>>  
>>>  /*
>>> - * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
>>> - * there the whole memory is already encrypted.
>>> + * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or
>>> + * private in TDX case) because there the whole memory is already encrypted.
>>>   */
>> But doesn't this mean that we can't ioremap() normal memory?
> It's not allowed anyway: see (io_desc.flags & IORES_MAP_SYSTEM_RAM) in the
> __ioremap_caller().
> 
>> I was somehow expecting that we would need to do this for some
>> host<->guest communication pages.
> It goes though DMA API, not ioremap().

Ahh, got it.  Thanks for the clarification.

It would help to make mention of that stuff in the changelog to make it
more obvious going forward.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 25/26] x86/tdx: Make DMA pages shared
  2021-04-01 21:01   ` Dave Hansen
@ 2021-04-06 16:31     ` Kirill A. Shutemov
  2021-04-06 16:38       ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Kirill A. Shutemov @ 2021-04-06 16:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kai Huang, Sean Christopherson

On Thu, Apr 01, 2021 at 02:01:15PM -0700, Dave Hansen wrote:
> > +int tdx_map_gpa(phys_addr_t gpa, int numpages, bool private)
> > +{
> > +	int ret, i;
> > +
> > +	ret = __tdx_map_gpa(gpa, numpages, private);
> > +	if (ret || !private)
> > +		return ret;
> > +
> > +	for (i = 0; i < numpages; i++)
> > +		tdx_accept_page(gpa + i*PAGE_SIZE);
> > +
> > +	return 0;
> > +}
> 
> Please do something like this:
> 
> enum tdx_max_type {
> 	TDX_MAP_PRIVATE,
> 	TDX_MAP_SHARED
> }
> 
> Then, your calls will look like:
> 
> 	tdx_map_gpa(gpa, nr, TDX_MAP_SHARED);
> 
> instead of:
> 
> 	tdx_map_gpa(gpa, nr, false);

Okay, makes sense.

> >  static __cpuidle void tdx_halt(void)
> >  {
> >  	register long r10 asm("r10") = TDVMCALL_STANDARD;
> > diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
> > index 964e04152417..b6d93b0c5dcf 100644
> > --- a/arch/x86/mm/mem_encrypt_common.c
> > +++ b/arch/x86/mm/mem_encrypt_common.c
> > @@ -15,9 +15,9 @@
> >  bool force_dma_unencrypted(struct device *dev)
> >  {
> >  	/*
> > -	 * For SEV, all DMA must be to unencrypted/shared addresses.
> > +	 * For SEV and TDX, all DMA must be to unencrypted/shared addresses.
> >  	 */
> > -	if (sev_active())
> > +	if (sev_active() || is_tdx_guest())
> >  		return true;
> >  
> >  	/*
> > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > index 16f878c26667..6f23a9816ef0 100644
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -27,6 +27,7 @@
> >  #include <asm/proto.h>
> >  #include <asm/memtype.h>
> >  #include <asm/set_memory.h>
> > +#include <asm/tdx.h>
> >  
> >  #include "../mm_internal.h"
> >  
> > @@ -1977,8 +1978,8 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> >  	struct cpa_data cpa;
> >  	int ret;
> >  
> > -	/* Nothing to do if memory encryption is not active */
> > -	if (!mem_encrypt_active())
> > +	/* Nothing to do if memory encryption and TDX are not active */
> > +	if (!mem_encrypt_active() && !is_tdx_guest())
> >  		return 0;
> 
> So, this is starting to look like the "enc" naming is wrong, or at least
> a little misleading.   Should we be talking about "protection" or
> "guards" or something?

Are you talking about the function argument or function name too?

> >  	/* Should not be working on unaligned addresses */
> > @@ -1988,8 +1989,14 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> >  	memset(&cpa, 0, sizeof(cpa));
> >  	cpa.vaddr = &addr;
> >  	cpa.numpages = numpages;
> > -	cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
> > -	cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
> > +	if (is_tdx_guest()) {
> > +		cpa.mask_set = __pgprot(enc ? 0 : tdx_shared_mask());
> > +		cpa.mask_clr = __pgprot(enc ? tdx_shared_mask() : 0);
> > +	} else {
> > +		cpa.mask_set = __pgprot(enc ? _PAGE_ENC : 0);
> > +		cpa.mask_clr = __pgprot(enc ? 0 : _PAGE_ENC);
> > +	}
> 
> OK, this is too hideous to live.  It sucks that the TDX and SEV/SME bits
> are opposite polarity, but oh well.
> 
> To me, this gets a lot clearer, and opens up room for commenting if you
> do something like:
> 
> 	if (is_tdx_guest()) {
> 		mem_enc_bits   = 0;
> 		mem_plain_bits = tdx_shared_mask();
> 	} else {
> 		mem_enc_bits   = _PAGE_ENC;
> 		mem_plain_bits = 0
> 	}
> 
> 	if (enc) {
> 		cpa.mask_set = mem_enc_bits;
> 		cpa.mask_clr = mem_plain_bits;  // clear "plain" bits
> 	} else {
> 		
> 		cpa.mask_set = mem_plain_bits;
> 		cpa.mask_clr = mem_enc_bits;	// clear encryption bits
> 	}

I'm not convinced that your approach it clearer. If you add the missing
__pgprot() it going to as ugly as the original.

But if a maintainer wants... :)

> >  	cpa.pgd = init_mm.pgd;
> >  
> >  	/* Must avoid aliasing mappings in the highmem code */
> > @@ -1999,7 +2006,8 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> >  	/*
> >  	 * Before changing the encryption attribute, we need to flush caches.
> >  	 */
> > -	cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
> > +	if (!enc || !is_tdx_guest())
> > +		cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
> 
> That "!enc" looks wrong to me.  Caches would need to be flushed whenever
> encryption attributes *change*, not just when they are set.
> 
> Also, cpa_flush() flushes caches *AND* the TLB.  How does TDX manage to
> not need TLB flushes?

I will double-check everthing, but I think we can skip *both* cpa_flush()
for private->shared conversion. VMM and TDX module will take care about
TLB and cache flush in response to MapGPA TDVMCALL.

> >  	ret = __change_page_attr_set_clr(&cpa, 1);
> >  
> > @@ -2012,6 +2020,11 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> >  	 */
> >  	cpa_flush(&cpa, 0);
> >  
> > +	if (!ret && is_tdx_guest()) {
> > +		ret = tdx_map_gpa(__pa(addr), numpages, enc);
> > +		// XXX: need to undo on error?
> > +	}
> 
> Time to fix this stuff up if you want folks to take this series more
> seriously.

My bad, will fix it.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code
  2021-04-06 16:11       ` Dave Hansen
@ 2021-04-06 16:37         ` Kirill A. Shutemov
  0 siblings, 0 replies; 161+ messages in thread
From: Kirill A. Shutemov @ 2021-04-06 16:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On Tue, Apr 06, 2021 at 09:11:25AM -0700, Dave Hansen wrote:
> On 4/6/21 8:37 AM, Kirill A. Shutemov wrote:
> > On Thu, Apr 01, 2021 at 01:06:29PM -0700, Dave Hansen wrote:
> >> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
> >>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >>>
> >>> Intel TDX doesn't allow VMM to access guest memory. Any memory that is
> >>> required for communication with VMM suppose to be shared explicitly by
> >>
> >> s/suppose to/must/
> > 
> > Right.
> > 
> >>> setting the bit in page table entry. The shared memory is similar to
> >>> unencrypted memory in AMD SME/SEV terminology.
> >>
> >> In addition to setting the page table bit, there's also a dance to go
> >> through to convert the memory.  Please mention the procedure here at
> >> least.  It's very different from SME.
> > 
> > "
> >   After setting the shared bit, the conversion must be completed with
> >   MapGPA TDVMALL. The call informs VMM about the conversion and makes it
> >   remove the GPA from the S-EPT mapping.
> > "
> 
> Where does the TDX module fit in here?

VMM must go through TLB Tracking Sequence which involves bunch of
SEAMCALLs. See 3.3.1.2 "Dynamic Page Removal (Private to Shared
Conversion)" of TDX Module spec.
> 
> >>> force_dma_unencrypted() has to return true for TDX guest. Move it out of
> >>> AMD SME code.
> >>
> >> You lost me here.  What does force_dma_unencrypted() have to do with
> >> host/guest shared memory?
> > 
> > "
> >   AMD SEV makes force_dma_unencrypted() return true which triggers
> >   set_memory_decrypted() calls on all DMA allocations. TDX will use the
> >   same code path to make DMA allocations shared.
> > "
> 
> SEV assumes that I/O devices can only do DMA to "decrypted" physical
> addresses without the C-bit set.  In order for the CPU to interact with
> this memory, the CPU needs a decrypted mapping.
> 
> TDX is similar.  TDX architecturally prevents access to private guest
> memory by anything other than the guest itself.  This means that any DMA
> buffers must be shared.
> 
> Right?

Yes.


-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 25/26] x86/tdx: Make DMA pages shared
  2021-04-06 16:31     ` Kirill A. Shutemov
@ 2021-04-06 16:38       ` Dave Hansen
  2021-04-06 17:16         ` Sean Christopherson
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-06 16:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Kai Huang, Sean Christopherson

On 4/6/21 9:31 AM, Kirill A. Shutemov wrote:
> On Thu, Apr 01, 2021 at 02:01:15PM -0700, Dave Hansen wrote:
>>> @@ -1977,8 +1978,8 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
>>>  	struct cpa_data cpa;
>>>  	int ret;
>>>  
>>> -	/* Nothing to do if memory encryption is not active */
>>> -	if (!mem_encrypt_active())
>>> +	/* Nothing to do if memory encryption and TDX are not active */
>>> +	if (!mem_encrypt_active() && !is_tdx_guest())
>>>  		return 0;
>>
>> So, this is starting to look like the "enc" naming is wrong, or at least
>> a little misleading.   Should we be talking about "protection" or
>> "guards" or something?
> 
> Are you talking about the function argument or function name too?

Yes, __set_memory_enc_dec() isn't really just doing "enc"ryption any more.

>>>  	/* Should not be working on unaligned addresses */
>>> @@ -1988,8 +1989,14 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
>>>  	memset(&cpa, 0, sizeof(cpa));
>>>  	cpa.vaddr = &addr;
>>>  	cpa.numpages = numpages;
>>> -	cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
>>> -	cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
>>> +	if (is_tdx_guest()) {
>>> +		cpa.mask_set = __pgprot(enc ? 0 : tdx_shared_mask());
>>> +		cpa.mask_clr = __pgprot(enc ? tdx_shared_mask() : 0);
>>> +	} else {
>>> +		cpa.mask_set = __pgprot(enc ? _PAGE_ENC : 0);
>>> +		cpa.mask_clr = __pgprot(enc ? 0 : _PAGE_ENC);
>>> +	}
>>
>> OK, this is too hideous to live.  It sucks that the TDX and SEV/SME bits
>> are opposite polarity, but oh well.
>>
>> To me, this gets a lot clearer, and opens up room for commenting if you
>> do something like:
>>
>> 	if (is_tdx_guest()) {
>> 		mem_enc_bits   = 0;
>> 		mem_plain_bits = tdx_shared_mask();
>> 	} else {
>> 		mem_enc_bits   = _PAGE_ENC;
>> 		mem_plain_bits = 0
>> 	}
>>
>> 	if (enc) {
>> 		cpa.mask_set = mem_enc_bits;
>> 		cpa.mask_clr = mem_plain_bits;  // clear "plain" bits
>> 	} else {
>> 		
>> 		cpa.mask_set = mem_plain_bits;
>> 		cpa.mask_clr = mem_enc_bits;	// clear encryption bits
>> 	}
> 
> I'm not convinced that your approach it clearer. If you add the missing
> __pgprot() it going to as ugly as the original.
> 
> But if a maintainer wants... :)

Yes, please.  I think my version (with the added __pgprot() conversions)
clearly separates out the two thing that are going on.

>>>  	cpa.pgd = init_mm.pgd;
>>>  
>>>  	/* Must avoid aliasing mappings in the highmem code */
>>> @@ -1999,7 +2006,8 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
>>>  	/*
>>>  	 * Before changing the encryption attribute, we need to flush caches.
>>>  	 */
>>> -	cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
>>> +	if (!enc || !is_tdx_guest())
>>> +		cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
>>
>> That "!enc" looks wrong to me.  Caches would need to be flushed whenever
>> encryption attributes *change*, not just when they are set.
>>
>> Also, cpa_flush() flushes caches *AND* the TLB.  How does TDX manage to
>> not need TLB flushes?
> 
> I will double-check everthing, but I think we can skip *both* cpa_flush()
> for private->shared conversion. VMM and TDX module will take care about
> TLB and cache flush in response to MapGPA TDVMCALL.

Oh, interesting.  You might also want to double check if there are any
more places where X86_FEATURE_SME_COHERENT and TDX have similar properties.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 25/26] x86/tdx: Make DMA pages shared
  2021-04-06 16:38       ` Dave Hansen
@ 2021-04-06 17:16         ` Sean Christopherson
  0 siblings, 0 replies; 161+ messages in thread
From: Sean Christopherson @ 2021-04-06 17:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel, Kai Huang, Sean Christopherson

On Tue, Apr 06, 2021, Dave Hansen wrote:
> On 4/6/21 9:31 AM, Kirill A. Shutemov wrote:
> > On Thu, Apr 01, 2021 at 02:01:15PM -0700, Dave Hansen wrote:
> >>> @@ -1999,7 +2006,8 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> >>>  	/*
> >>>  	 * Before changing the encryption attribute, we need to flush caches.
> >>>  	 */
> >>> -	cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
> >>> +	if (!enc || !is_tdx_guest())
> >>> +		cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
> >>
> >> That "!enc" looks wrong to me.  Caches would need to be flushed whenever
> >> encryption attributes *change*, not just when they are set.
> >>
> >> Also, cpa_flush() flushes caches *AND* the TLB.  How does TDX manage to
> >> not need TLB flushes?
> > 
> > I will double-check everthing, but I think we can skip *both* cpa_flush()
> > for private->shared conversion. VMM and TDX module will take care about
> > TLB and cache flush in response to MapGPA TDVMCALL.

No, on both accounts.

The guest is always responsible for flushing so called "linear mappings", i.e.
the gva -> gpa translations.  The VMM / TDX Module are responsible for flushing
the "guest-physical mappings" and "combined mappings" when the shared EPT /
secure EPT tables are modified.  E.g. the VMM could choose to keep separate
memory pools for shared vs. private and not even touch EPT tables on conversion.
But, the guest would still need to invalidate its virt->phys translations so
that accesses from within the guest generate the correct gpa.

Regarding cache flushing, the guest is responsible for flushing the cache lines
when converting from private to shared, and the VMM is responsible for flushing
the cache lines when converting from shared to private.

For private->shared, the VMM _can't_ do a targeted flush, as it can't generate
the correct physical address since stuffing a private key into its page tables
will #PF.  The VMM could do a full WBINVD, but that's not the intended ABI.
Hopefully this is documented in the GHCI...

For shared->private, the VMM is responsible for flushing the caches, assuming it
reuses the same physical page.  The TDX module does not enforce this directly,
rather TDX relies on integrity checks to detect if stale data (with the shared
key) is written back to guest private memory.  I.e. if the VMM does not do the
necessary flushing, the guest will get a poisoned memory #MC and die (or crash
the host).

> Oh, interesting.  You might also want to double check if there are any
> more places where X86_FEATURE_SME_COHERENT and TDX have similar properties.



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-03-30 16:57                                               ` Sean Christopherson
@ 2021-04-07 15:24                                                 ` Andi Kleen
  0 siblings, 0 replies; 161+ messages in thread
From: Andi Kleen @ 2021-04-07 15:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andy Lutomirski, Andy Lutomirski, Kuppuswamy, Sathyanarayanan,
	Peter Zijlstra, Dave Hansen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok, LKML

> Hmm, I forgot about that quirk.  I would expect the TDX Module to inject a #GP
> for that case.  I can't find anything in the spec that confirms or denies that,
> but injecting #VE would be weird and pointless.
> 
> Andi/Sathya, the TDX Module spec should be updated to state that XSETBV will
> #GP at CPL!=0.  If that's not already the behavior, the module should probably
> be changed...

I asked about this and the answer was that XSETBV behaves architecturally inside
a TD (no #VE), thus there is nothing to document.

-Andi

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC v1 00/26] Add TDX Guest Support
  2021-04-04 15:02 ` Dave Hansen
@ 2021-04-12 17:24   ` Dan Williams
  0 siblings, 0 replies; 161+ messages in thread
From: Dan Williams @ 2021-04-12 17:24 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Sun, Apr 4, 2021 at 8:02 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> It occurred to me that I've been doing a lot of digging in the TDX spec
> lately.  I think we can all agree that the "Architecture Specification"
> is not the world's easiest, most disgestable reading.  It's hard to
> figure out what the Linux relation to the spec is.
>
> One bit of Documentation we need for TDX is a description of the memory
> states.  For instance, it would be nice to spell out the different
> classes of memory, how they are selected, who selects them, and who
> enforces the selection.  What faults are generated on each type and who
> can induce those?
>
> For instance:
>
> TD-Private memory is selected by the Shared/Private bit in Present=1
> guest PTEs.  When the hardware page walker sees that bit, it walk the
> secure EPT.  The secure EPT entries can only be written by the TDX
> module, although they are written at the request of the VMM.  The TDX
> module enforces rules like ensuring that the memory mapped by secure EPT
> is not mapped multiple times.  The VMM can remove entries.  From the
> guest perspective, all private memory accesses are either successful, or
> result in a #VE.  Private memory access does not cause VMExits.
>
> Would that be useful to folks?

That paragraph was useful for me as someone coming in cold to TDX
patch review. +1 for more of that style of commentary.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-03-26 23:38                         ` [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() " Kuppuswamy Sathyanarayanan
@ 2021-04-20 17:36                           ` Dave Hansen
  2021-04-20 19:20                             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-20 17:36 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 3/26/21 4:38 PM, Kuppuswamy Sathyanarayanan wrote:
> Implement common helper functions to communicate with
> the TDX Module and VMM (using TDCALL instruction).

This is missing any kind of background.  I'd say:

Guests communicate with VMMs with hypercalls. Historically, these are
implemented using instructions that are known to cause VMEXITs like
<examples here>.  However, with TDX, VMEXITs no longer expose guest
state from the host.  This prevents the old hypercall mechanisms from
working....

... and then go on to talk about what you are introducing, why there are
two of them and so forth.

> __tdvmcall() function can be used to request services
> from VMM.

	^ "from a VMM" or "from the VMM", please

> __tdcall() function can be used to communicate with the
> TDX Module.
> 
> Using common helper functions makes the code more readable
> and less error prone compared to distributed and use case
> specific inline assembly code. Only downside in using this

				 ^ "The only downside..."

> approach is, it adds a few extra instructions for every
> TDCALL use case when compared to distributed checks. Although
> it's a bit less efficient, it's worth it to make the code more
> readable.

What's a "distributed check"?

This also doesn't talk at all about why this approach was chosen versus
inline assembly.  You're going to be asked "why not use inline asm?"

> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -8,12 +8,35 @@
>  #ifdef CONFIG_INTEL_TDX_GUEST
>  
>  #include <asm/cpufeature.h>
> +#include <linux/types.h>
> +
> +struct tdcall_output {
> +	u64 rcx;
> +	u64 rdx;
> +	u64 r8;
> +	u64 r9;
> +	u64 r10;
> +	u64 r11;
> +};
> +
> +struct tdvmcall_output {
> +	u64 r11;
> +	u64 r12;
> +	u64 r13;
> +	u64 r14;
> +	u64 r15;
> +};
>  
>  /* Common API to check TDX support in decompression and common kernel code. */
>  bool is_tdx_guest(void);
>  
>  void __init tdx_early_init(void);
>  
> +u64 __tdcall(u64 fn, u64 rcx, u64 rdx, struct tdcall_output *out);
> +
> +u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
> +	       struct tdvmcall_output *out);

Some one-liner comments about what these do would be nice.

>  #else // !CONFIG_INTEL_TDX_GUEST
>  
>  static inline bool is_tdx_guest(void)
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index ea111bf50691..7966c10ea8d1 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
>  obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
>  
>  obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
> -obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
> +obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
>  
>  obj-$(CONFIG_EISA)		+= eisa.o
>  obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
> diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
> index 60b9f42ce3c1..72de0b49467e 100644
> --- a/arch/x86/kernel/asm-offsets.c
> +++ b/arch/x86/kernel/asm-offsets.c
> @@ -23,6 +23,10 @@
>  #include <xen/interface/xen.h>
>  #endif
>  
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +#include <asm/tdx.h>
> +#endif
> +
>  #ifdef CONFIG_X86_32
>  # include "asm-offsets_32.c"
>  #else
> @@ -75,6 +79,24 @@ static void __used common(void)
>  	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
>  #endif
>  
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +	BLANK();
> +	/* Offset for fields in tdcall_output */
> +	OFFSET(TDCALL_rcx, tdcall_output, rcx);
> +	OFFSET(TDCALL_rdx, tdcall_output, rdx);
> +	OFFSET(TDCALL_r8, tdcall_output, r8);
> +	OFFSET(TDCALL_r9, tdcall_output, r9);
> +	OFFSET(TDCALL_r10, tdcall_output, r10);
> +	OFFSET(TDCALL_r11, tdcall_output, r11);

			 ^ vertically align this

> +	/* Offset for fields in tdvmcall_output */
> +	OFFSET(TDVMCALL_r11, tdvmcall_output, r11);
> +	OFFSET(TDVMCALL_r12, tdvmcall_output, r12);
> +	OFFSET(TDVMCALL_r13, tdvmcall_output, r13);
> +	OFFSET(TDVMCALL_r14, tdvmcall_output, r14);
> +	OFFSET(TDVMCALL_r15, tdvmcall_output, r15);
> +#endif
> +
>  	BLANK();
>  	OFFSET(BP_scratch, boot_params, scratch);
>  	OFFSET(BP_secure_boot, boot_params, secure_boot);
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> new file mode 100644
> index 000000000000..a73b67c0b407
> --- /dev/null
> +++ b/arch/x86/kernel/tdcall.S
> @@ -0,0 +1,163 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <asm/asm-offsets.h>
> +#include <asm/asm.h>
> +#include <asm/frame.h>
> +#include <asm/unwind_hints.h>
> +
> +#include <linux/linkage.h>
> +
> +#define TDVMCALL_EXPOSE_REGS_MASK	0xfc00

This looks like an undocumented magic number.

> +/*
> + * TDCALL instruction is newly added in TDX architecture,
> + * used by TD for requesting the host VMM to provide
> + * (untrusted) services. Supported in Binutils >= 2.36
> + */

Host VMM *AND* TD-module, right?

> +#define tdcall .byte 0x66,0x0f,0x01,0xcc

How well will the "newly added" comment age?

"host VMM" is redundant.

/*
 * TDX guests use the TDCALL instruction to make
 * hypercalls to the VMM.  ...


> +/* Only for non TDVMCALL use cases */
> +SYM_FUNC_START(__tdcall)
> +	FRAME_BEGIN
> +
> +	/* Save/restore non-volatile GPRs that are exposed to the VMM. */
> +	push %r15
> +	push %r14
> +	push %r13
> +	push %r12

How is this restoring GPRs?

> +	/*
> +	 * RDI  => RAX = TDCALL leaf
> +	 * RSI  => RCX = input param 1
> +	 * RDX  => RDX = input param 2
> +	 * RCX  => N/A = output struct
> +	 */

I don't like this block comment.  These should be interspersed with the
instructions.  It's actually redundant with what's below.

> +	/* Save output pointer to R12 */
> +	mov %rcx, %r12

Is this a "save" or a "move"?  Isn't this moving the function argument
"%rcx" to the TDCALL register argument "%r12"?

> +	/* Move TDCALL Leaf ID to RAX */
> +	mov %rdi, %rax
> +	/* Move input param 1 to rcx*/
> +	mov %rsi, %rcx

This needs a comment:

	/* Leave the third function argument (%RDX) in place */

> +	tdcall
> +
> +	/*
> +	 * On success, propagate TDCALL outputs values to the output struct,
> +	 * if an output struct is provided.
> +	 */

Again, I don't like the comment separated from the instructions.  This
should be:

	
	/* Check for TDCALL success: */
> +	test %rax, %rax
> +	jnz 1f

	/* Check for a TDCALL output struct */
> +	test %r12, %r12
> +	jz 1f

	/* Copy TDCALL result registers to output struct: */
> +	movq %rcx, TDCALL_rcx(%r12)
> +	movq %rdx, TDCALL_rdx(%r12)
> +	movq %r8, TDCALL_r8(%r12)
> +	movq %r9, TDCALL_r9(%r12)
> +	movq %r10, TDCALL_r10(%r12)
> +	movq %r11, TDCALL_r11(%r12)

		 ^ Vertically align this

> +1:
> +	/*
> +	 * Zero out registers exposed to the VMM to avoid speculative execution
> +	 * with VMM-controlled values.
> +	 */
> +        xor %rcx, %rcx
> +        xor %rdx, %rdx
> +        xor %r8d, %r8d
> +        xor %r9d, %r9d
> +        xor %r10d, %r10d
> +        xor %r11d, %r11d

This has tabs-versus-space problems.

Also, is this the architectural list of *POSSIBLE* registers to which
the VMM can write?

> +	pop %r12
> +	pop %r13
> +	pop %r14
> +	pop %r15
> +
> +	FRAME_END
> +	ret
> +SYM_FUNC_END(__tdcall)
> +
> +.macro tdvmcall_core
> +	FRAME_BEGIN
> +
> +	/* Save/restore non-volatile GPRs that are exposed to the VMM. */

Again, where's the "restore"?

> +	push %r15
> +	push %r14
> +	push %r13
> +	push %r12
> +
> +	/*
> +	 * 0    => RAX = TDCALL leaf
> +	 * RDI  => R11 = TDVMCALL function, e.g. exit reason
> +	 * RSI  => R12 = input param 0
> +	 * RDX  => R13 = input param 1
> +	 * RCX  => R14 = input param 2
> +	 * R8   => R15 = input param 3
> +	 * MASK => RCX = TDVMCALL register behavior
> +	 * R9   => R9  = output struct
> +	 */
> +
> +	xor %eax, %eax
> +	mov %rdi, %r11
> +	mov %rsi, %r12
> +	mov %rdx, %r13
> +	mov %rcx, %r14
> +	mov %r8,  %r15
> +
> +	/*
> +	 * Expose R10 - R15, i.e. all GPRs that may be used by TDVMCALLs
> +	 * defined in the GHCI.  Note, RAX and RCX are consumed, but only by
> +	 * TDX-Module and so don't need to be listed in the mask.
> +	 */

"GCHI" is out of the blue here.  So is "TDX-Module".  There needs to be
more context.

> +	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
> +
> +	tdcall
> +
> +	/* Panic if TDCALL reports failure. */
> +	test %rax, %rax
> +	jnz 2f

Why panic?

Also, do you *REALLY* need to do this from assembly?  Can't it be done
in the C wrapper?

> +	/* Propagate TDVMCALL success/failure to return value. */
> +	mov %r10, %rax

You just said it panic's on failure.  How can this propagate failure?

> +	/*
> +	 * On success, propagate TDVMCALL outputs values to the output struct,
> +	 * if an output struct is provided.
> +	 */
> +	test %rax, %rax
> +	jnz 1f
> +	test %r9, %r9
> +	jz 1f
> +
> +	movq %r11, TDVMCALL_r11(%r9)
> +	movq %r12, TDVMCALL_r12(%r9)
> +	movq %r13, TDVMCALL_r13(%r9)
> +	movq %r14, TDVMCALL_r14(%r9)
> +	movq %r15, TDVMCALL_r15(%r9)
> +1:
> +	/*
> +	 * Zero out registers exposed to the VMM to avoid speculative execution
> +	 * with VMM-controlled values.
> +	 */

Please evenly split the comment across those two lines.  (Do this
everywhere in the series).

> +	xor %r10d, %r10d
> +	xor %r11d, %r11d
> +	xor %r12d, %r12d
> +	xor %r13d, %r13d
> +	xor %r14d, %r14d
> +	xor %r15d, %r15d
> +
> +	pop %r12
> +	pop %r13
> +	pop %r14
> +	pop %r15
> +
> +	FRAME_END
> +	ret
> +2:
> +	ud2
> +.endm
> +
> +SYM_FUNC_START(__tdvmcall)
> +	xor %r10, %r10
> +	tdvmcall_core
> +SYM_FUNC_END(__tdvmcall)
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 0d00dd50a6ff..1147e7e765d6 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -3,6 +3,36 @@
>  
>  #include <asm/tdx.h>
>  
> +/*
> + * Wrapper for the common case with standard output value (R10).
> + */

... and oddly enough there is no explicit mention of R10 anywhere.  Why?

> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
> +{
> +	u64 err;
> +
> +	err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
> +
> +	WARN_ON(err);
> +
> +	return err;
> +}

Are there really *ZERO* reasons for a TDVMCALL to return an error?
Won't this let a malicious VMM spew endless warnings into the guest console?

> +/*
> + * Wrapper for the semi-common case where we need single output value (R11).
> + */
> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
> +{
> +
> +	struct tdvmcall_output out = {0};
> +	u64 err;
> +
> +	err = __tdvmcall(fn, r12, r13, r14, r15, &out);
> +
> +	WARN_ON(err);
> +
> +	return out.r11;
> +}
> +

But you introduced __tdvmcall and __tdcall assembly functions.  Why
aren't both of them used?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-20 17:36                           ` Dave Hansen
@ 2021-04-20 19:20                             ` Kuppuswamy, Sathyanarayanan
  2021-04-20 19:59                               ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-20 19:20 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel



On 4/20/21 10:36 AM, Dave Hansen wrote:
> On 3/26/21 4:38 PM, Kuppuswamy Sathyanarayanan wrote:
>> Implement common helper functions to communicate with
>> the TDX Module and VMM (using TDCALL instruction).
> 
> This is missing any kind of background.  I'd say:
> 
> Guests communicate with VMMs with hypercalls. Historically, these are
> implemented using instructions that are known to cause VMEXITs like
> <examples here>.  However, with TDX, VMEXITs no longer expose guest
> state from the host.  This prevents the old hypercall mechanisms from
> working....
> 
> ... and then go on to talk about what you are introducing, why there are
> two of them and so forth.
Ok. I will add it.
> 
>> __tdvmcall() function can be used to request services
>> from VMM.
> 
> 	^ "from a VMM" or "from the VMM", please
> 
will use "from the VMM".
>> __tdcall() function can be used to communicate with the
>> TDX Module.
>>
>> Using common helper functions makes the code more readable
>> and less error prone compared to distributed and use case
>> specific inline assembly code. Only downside in using this
> 
> 				 ^ "The only downside..."
will fix it.
> 
>> approach is, it adds a few extra instructions for every
>> TDCALL use case when compared to distributed checks. Although
>> it's a bit less efficient, it's worth it to make the code more
>> readable.
> 
> What's a "distributed check"?

It should be "distributed TDVMCALL/TDCALL inline assembly calls"
> 
> This also doesn't talk at all about why this approach was chosen versus
> inline assembly.  You're going to be asked "why not use inline asm?"
"To make the core more readable and less error prone." I have added this info
in above paragraph. Do you think we need more argument to justify our approach?
> 
>> --- a/arch/x86/include/asm/tdx.h
>> +++ b/arch/x86/include/asm/tdx.h
>> @@ -8,12 +8,35 @@
>>   #ifdef CONFIG_INTEL_TDX_GUEST
>>   
>>   #include <asm/cpufeature.h>
>> +#include <linux/types.h>
>> +
>> +struct tdcall_output {
>> +	u64 rcx;
>> +	u64 rdx;
>> +	u64 r8;
>> +	u64 r9;
>> +	u64 r10;
>> +	u64 r11;
>> +};
>> +
>> +struct tdvmcall_output {
>> +	u64 r11;
>> +	u64 r12;
>> +	u64 r13;
>> +	u64 r14;
>> +	u64 r15;
>> +};
>>   
>>   /* Common API to check TDX support in decompression and common kernel code. */
>>   bool is_tdx_guest(void);
>>   
>>   void __init tdx_early_init(void);
>>   
>> +u64 __tdcall(u64 fn, u64 rcx, u64 rdx, struct tdcall_output *out);
>> +
>> +u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
>> +	       struct tdvmcall_output *out);
> 
> Some one-liner comments about what these do would be nice.
will do.
> 
>>   #else // !CONFIG_INTEL_TDX_GUEST
>>   
>>   static inline bool is_tdx_guest(void)
>> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
>> index ea111bf50691..7966c10ea8d1 100644
>> --- a/arch/x86/kernel/Makefile
>> +++ b/arch/x86/kernel/Makefile
>> @@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
>>   obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
>>   
>>   obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
>> -obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
>> +obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
>>   
>>   obj-$(CONFIG_EISA)		+= eisa.o
>>   obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
>> diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
>> index 60b9f42ce3c1..72de0b49467e 100644
>> --- a/arch/x86/kernel/asm-offsets.c
>> +++ b/arch/x86/kernel/asm-offsets.c
>> @@ -23,6 +23,10 @@
>>   #include <xen/interface/xen.h>
>>   #endif
>>   
>> +#ifdef CONFIG_INTEL_TDX_GUEST
>> +#include <asm/tdx.h>
>> +#endif
>> +
>>   #ifdef CONFIG_X86_32
>>   # include "asm-offsets_32.c"
>>   #else
>> @@ -75,6 +79,24 @@ static void __used common(void)
>>   	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
>>   #endif
>>   
>> +#ifdef CONFIG_INTEL_TDX_GUEST
>> +	BLANK();
>> +	/* Offset for fields in tdcall_output */
>> +	OFFSET(TDCALL_rcx, tdcall_output, rcx);
>> +	OFFSET(TDCALL_rdx, tdcall_output, rdx);
>> +	OFFSET(TDCALL_r8, tdcall_output, r8);
>> +	OFFSET(TDCALL_r9, tdcall_output, r9);
>> +	OFFSET(TDCALL_r10, tdcall_output, r10);
>> +	OFFSET(TDCALL_r11, tdcall_output, r11);
> 
> 			 ^ vertically align this
> 
will fix it.
>> +	/* Offset for fields in tdvmcall_output */
>> +	OFFSET(TDVMCALL_r11, tdvmcall_output, r11);
>> +	OFFSET(TDVMCALL_r12, tdvmcall_output, r12);
>> +	OFFSET(TDVMCALL_r13, tdvmcall_output, r13);
>> +	OFFSET(TDVMCALL_r14, tdvmcall_output, r14);
>> +	OFFSET(TDVMCALL_r15, tdvmcall_output, r15);
>> +#endif
>> +
>>   	BLANK();
>>   	OFFSET(BP_scratch, boot_params, scratch);
>>   	OFFSET(BP_secure_boot, boot_params, secure_boot);
>> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
>> new file mode 100644
>> index 000000000000..a73b67c0b407
>> --- /dev/null
>> +++ b/arch/x86/kernel/tdcall.S
>> @@ -0,0 +1,163 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#include <asm/asm-offsets.h>
>> +#include <asm/asm.h>
>> +#include <asm/frame.h>
>> +#include <asm/unwind_hints.h>
>> +
>> +#include <linux/linkage.h>
>> +
>> +#define TDVMCALL_EXPOSE_REGS_MASK	0xfc00
> 
> This looks like an undocumented magic number.
> 
>> +/*
>> + * TDCALL instruction is newly added in TDX architecture,
>> + * used by TD for requesting the host VMM to provide
>> + * (untrusted) services. Supported in Binutils >= 2.36
>> + */
> 
> Host VMM *AND* TD-module, right?
Yes, you are correct. I will fix it.
> 
>> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
> 
> How well will the "newly added" comment age?
> 
> "host VMM" is redundant.
> 
> /*
>   * TDX guests use the TDCALL instruction to make
>   * hypercalls to the VMM.  ...
will use it.
> 
> 
>> +/* Only for non TDVMCALL use cases */
>> +SYM_FUNC_START(__tdcall)
>> +	FRAME_BEGIN
>> +
>> +	/* Save/restore non-volatile GPRs that are exposed to the VMM. */
>> +	push %r15
>> +	push %r14
>> +	push %r13
>> +	push %r12
> 
> How is this restoring GPRs?
I have used the same comment for both push/pop combinations. Will remove
restore from above comment.
> 
>> +	/*
>> +	 * RDI  => RAX = TDCALL leaf
>> +	 * RSI  => RCX = input param 1
>> +	 * RDX  => RDX = input param 2
>> +	 * RCX  => N/A = output struct
>> +	 */
> 
> I don't like this block comment.  These should be interspersed with the
> instructions.  It's actually redundant with what's below.
I just want to show register mapping details in one place (similar to C
function comments). But I am fine with instruction specific comments.
I will fix it in next version.
> 
>> +	/* Save output pointer to R12 */
>> +	mov %rcx, %r12
> 
> Is this a "save" or a "move"?  Isn't this moving the function argument
> "%rcx" to the TDCALL register argument "%r12"?
> 
>> +	/* Move TDCALL Leaf ID to RAX */
>> +	mov %rdi, %rax
>> +	/* Move input param 1 to rcx*/
>> +	mov %rsi, %rcx
> 
> This needs a comment:
> 
> 	/* Leave the third function argument (%RDX) in place */
> 
Ok.
>> +	tdcall
>> +
>> +	/*
>> +	 * On success, propagate TDCALL outputs values to the output struct,
>> +	 * if an output struct is provided.
>> +	 */
> 
> Again, I don't like the comment separated from the instructions.  This
> should be:
will use instruction specific comments.
> 
> 	
> 	/* Check for TDCALL success: */
>> +	test %rax, %rax
>> +	jnz 1f
> 
> 	/* Check for a TDCALL output struct */
>> +	test %r12, %r12
>> +	jz 1f
> 
> 	/* Copy TDCALL result registers to output struct: */
>> +	movq %rcx, TDCALL_rcx(%r12)
>> +	movq %rdx, TDCALL_rdx(%r12)
>> +	movq %r8, TDCALL_r8(%r12)
>> +	movq %r9, TDCALL_r9(%r12)
>> +	movq %r10, TDCALL_r10(%r12)
>> +	movq %r11, TDCALL_r11(%r12)
> 
> 		 ^ Vertically align this
will do.
> 
>> +1:
>> +	/*
>> +	 * Zero out registers exposed to the VMM to avoid speculative execution
>> +	 * with VMM-controlled values.
>> +	 */
>> +        xor %rcx, %rcx
>> +        xor %rdx, %rdx
>> +        xor %r8d, %r8d
>> +        xor %r9d, %r9d
>> +        xor %r10d, %r10d
>> +        xor %r11d, %r11d
> 
> This has tabs-versus-space problems.
> 
> Also, is this the architectural list of *POSSIBLE* registers to which
> the VMM can write?
> 
>> +	pop %r12
>> +	pop %r13
>> +	pop %r14
>> +	pop %r15
>> +
>> +	FRAME_END
>> +	ret
>> +SYM_FUNC_END(__tdcall)
>> +
>> +.macro tdvmcall_core
>> +	FRAME_BEGIN
>> +
>> +	/* Save/restore non-volatile GPRs that are exposed to the VMM. */
> 
> Again, where's the "restore"?
I have used the same comment for both push/pop combinations. Will remove
restore from above comment.
> 
>> +	push %r15
>> +	push %r14
>> +	push %r13
>> +	push %r12
>> +
>> +	/*
>> +	 * 0    => RAX = TDCALL leaf
>> +	 * RDI  => R11 = TDVMCALL function, e.g. exit reason
>> +	 * RSI  => R12 = input param 0
>> +	 * RDX  => R13 = input param 1
>> +	 * RCX  => R14 = input param 2
>> +	 * R8   => R15 = input param 3
>> +	 * MASK => RCX = TDVMCALL register behavior
>> +	 * R9   => R9  = output struct
>> +	 */
>> +
>> +	xor %eax, %eax
>> +	mov %rdi, %r11
>> +	mov %rsi, %r12
>> +	mov %rdx, %r13
>> +	mov %rcx, %r14
>> +	mov %r8,  %r15
>> +
>> +	/*
>> +	 * Expose R10 - R15, i.e. all GPRs that may be used by TDVMCALLs
>> +	 * defined in the GHCI.  Note, RAX and RCX are consumed, but only by
>> +	 * TDX-Module and so don't need to be listed in the mask.
>> +	 */
> 
> "GCHI" is out of the blue here.  So is "TDX-Module".  There needs to be
> more context.
ok. will add it. Do you want GHCI spec reference with section id here?
> 
>> +	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
>> +
>> +	tdcall
>> +
>> +	/* Panic if TDCALL reports failure. */
>> +	test %rax, %rax
>> +	jnz 2f
> 
> Why panic?
As per spec, TDCALL should never fail. Note that it has nothing to do
with TDVMCALL function specific failure (which is reported via R10).
> 
> Also, do you *REALLY* need to do this from assembly?  Can't it be done
> in the C wrapper?
Its common for all use cases of TDVMCALL (vendor specific, in/out, etc). so added
it here.
> 
>> +	/* Propagate TDVMCALL success/failure to return value. */
>> +	mov %r10, %rax
> 
> You just said it panic's on failure.  How can this propagate failure?
we use panic for TDCALL failure. But, R10 content used to identify whether given
TDVMCALL function operation is successful or not.
> 
>> +	/*
>> +	 * On success, propagate TDVMCALL outputs values to the output struct,
>> +	 * if an output struct is provided.
>> +	 */
>> +	test %rax, %rax
>> +	jnz 1f
>> +	test %r9, %r9
>> +	jz 1f
>> +
>> +	movq %r11, TDVMCALL_r11(%r9)
>> +	movq %r12, TDVMCALL_r12(%r9)
>> +	movq %r13, TDVMCALL_r13(%r9)
>> +	movq %r14, TDVMCALL_r14(%r9)
>> +	movq %r15, TDVMCALL_r15(%r9)
>> +1:
>> +	/*
>> +	 * Zero out registers exposed to the VMM to avoid speculative execution
>> +	 * with VMM-controlled values.
>> +	 */
> 
> Please evenly split the comment across those two lines.  (Do this
> everywhere in the series).
ok.
> 
>> +	xor %r10d, %r10d
>> +	xor %r11d, %r11d
>> +	xor %r12d, %r12d
>> +	xor %r13d, %r13d
>> +	xor %r14d, %r14d
>> +	xor %r15d, %r15d
>> +
>> +	pop %r12
>> +	pop %r13
>> +	pop %r14
>> +	pop %r15
>> +
>> +	FRAME_END
>> +	ret
>> +2:
>> +	ud2
>> +.endm
>> +
>> +SYM_FUNC_START(__tdvmcall)
>> +	xor %r10, %r10
>> +	tdvmcall_core
>> +SYM_FUNC_END(__tdvmcall)
>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>> index 0d00dd50a6ff..1147e7e765d6 100644
>> --- a/arch/x86/kernel/tdx.c
>> +++ b/arch/x86/kernel/tdx.c
>> @@ -3,6 +3,36 @@
>>   
>>   #include <asm/tdx.h>
>>   
>> +/*
>> + * Wrapper for the common case with standard output value (R10).
>> + */
> 
> ... and oddly enough there is no explicit mention of R10 anywhere.  Why?
For Guest to Host call -> R10 holds TDCALL function id (which is 0 for TDVMCALL). so
we don't need special argument.
After TDVMCALL execution, R10 value is returned via RAX.
> 
>> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>> +{
>> +	u64 err;
>> +
>> +	err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
>> +
>> +	WARN_ON(err);
>> +
>> +	return err;
>> +}
> 
> Are there really *ZERO* reasons for a TDVMCALL to return an error?
No. Its useful for debugging TDVMCALL failures.
> Won't this let a malicious VMM spew endless warnings into the guest console?
As per GHCI spec, R10 will hold error code details which can be used to determine
the type of TDVMCALL failure. More warnings at-least show that we are working
with malicious VMM.
> 
>> +/*
>> + * Wrapper for the semi-common case where we need single output value (R11).
>> + */
>> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>> +{
>> +
>> +	struct tdvmcall_output out = {0};
>> +	u64 err;
>> +
>> +	err = __tdvmcall(fn, r12, r13, r14, r15, &out);
>> +
>> +	WARN_ON(err);
>> +
>> +	return out.r11;
>> +}
>> +
> 
> But you introduced __tdvmcall and __tdcall assembly functions.  Why
> aren't both of them used?
This patch just adds helper functions. Its used by other patches in the
series. __tdvmcall is used in this patch because we need to add more
wrappers for it.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-20 19:20                             ` Kuppuswamy, Sathyanarayanan
@ 2021-04-20 19:59                               ` Dave Hansen
  2021-04-20 23:12                                 ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-20 19:59 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 4/20/21 12:20 PM, Kuppuswamy, Sathyanarayanan wrote:
>>> approach is, it adds a few extra instructions for every
>>> TDCALL use case when compared to distributed checks. Although
>>> it's a bit less efficient, it's worth it to make the code more
>>> readable.
>>
>> What's a "distributed check"?
> 
> It should be "distributed TDVMCALL/TDCALL inline assembly calls"

It's still not clear to what that refers.

>> This also doesn't talk at all about why this approach was chosen versus
>> inline assembly.  You're going to be asked "why not use inline asm?"
> "To make the core more readable and less error prone." I have added this
> info in above paragraph. Do you think we need more argument to
> justify our approach?

Yes, you need much more justification.  That's pretty generic and
non-specific.

>>> +    /*
>>> +     * Expose R10 - R15, i.e. all GPRs that may be used by TDVMCALLs
>>> +     * defined in the GHCI.  Note, RAX and RCX are consumed, but
>>> only by
>>> +     * TDX-Module and so don't need to be listed in the mask.
>>> +     */
>>
>> "GCHI" is out of the blue here.  So is "TDX-Module".  There needs to be
>> more context.
> ok. will add it. Do you want GHCI spec reference with section id here?

Absolutely not.  I dislike all of the section references as-is.  Doesn't
a comment like this say what you said above and also add context?

	Expose every register currently used in the
	guest-to-host communication interface (GHCI).

>>> +    movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
>>> +
>>> +    tdcall
>>> +
>>> +    /* Panic if TDCALL reports failure. */
>>> +    test %rax, %rax
>>> +    jnz 2f
>>
>> Why panic?
> As per spec, TDCALL should never fail. Note that it has nothing to do
> with TDVMCALL function specific failure (which is reported via R10).

You've introduced two concepts here, without differentiating them.  You
need to work to differentiate these two kinds of failure somewhere.  You
can't simply refer to both as "failure".

>> Also, do you *REALLY* need to do this from assembly?  Can't it be done
>> in the C wrapper?
> Its common for all use cases of TDVMCALL (vendor specific, in/out, etc).
> so added
> it here.

That's not a good reason.  You could just as easily have a C wrapper
which all uses of TDVMCALL go through.

>>> +    /* Propagate TDVMCALL success/failure to return value. */
>>> +    mov %r10, %rax
>>
>> You just said it panic's on failure.  How can this propagate failure?
> we use panic for TDCALL failure. But, R10 content used to identify
> whether given
> TDVMCALL function operation is successful or not.

As I said above, please endeavor to differentiate the two classes of
failures.

Also, if the spec is violated, do you *REALLY* want to blow up the whole
world with a panic?  I guess you can argue that it could have been
something security-related that failed.  But, either way, you owe a
description of why panic'ing is a good idea, not just blindly deferring
to "the spec says this can't happen".

>>> +    xor %r10d, %r10d
>>> +    xor %r11d, %r11d
>>> +    xor %r12d, %r12d
>>> +    xor %r13d, %r13d
>>> +    xor %r14d, %r14d
>>> +    xor %r15d, %r15d
>>> +
>>> +    pop %r12
>>> +    pop %r13
>>> +    pop %r14
>>> +    pop %r15
>>> +
>>> +    FRAME_END
>>> +    ret
>>> +2:
>>> +    ud2
>>> +.endm
>>> +
>>> +SYM_FUNC_START(__tdvmcall)
>>> +    xor %r10, %r10
>>> +    tdvmcall_core
>>> +SYM_FUNC_END(__tdvmcall)
>>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>>> index 0d00dd50a6ff..1147e7e765d6 100644
>>> --- a/arch/x86/kernel/tdx.c
>>> +++ b/arch/x86/kernel/tdx.c
>>> @@ -3,6 +3,36 @@
>>>     #include <asm/tdx.h>
>>>   +/*
>>> + * Wrapper for the common case with standard output value (R10).
>>> + */
>>
>> ... and oddly enough there is no explicit mention of R10 anywhere.  Why?
> For Guest to Host call -> R10 holds TDCALL function id (which is 0 for
> TDVMCALL). so
> we don't need special argument.
> After TDVMCALL execution, R10 value is returned via RAX.

OK... so this is how it works.  But why mention R10 in the comment?  Why
is *THAT* worth mentioning?

>>> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>>> +{
>>> +    u64 err;
>>> +
>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
>>> +
>>> +    WARN_ON(err);
>>> +
>>> +    return err;
>>> +}
>>
>> Are there really *ZERO* reasons for a TDVMCALL to return an error?
> No. Its useful for debugging TDVMCALL failures.
>> Won't this let a malicious VMM spew endless warnings into the guest
>> console?
> As per GHCI spec, R10 will hold error code details which can be used to
> determine
> the type of TDVMCALL failure.

I would encourage you to stop citing the GCCI spec.  In all of these
conversations, you seem to rely on it without considering the underlying
reasons.  The fact that R10 is the error code is 100% irrelevant for
this conversation.

It's also entirely possible that the host would have bugs, or forget to
clear a bit somewhere, even if the spec says, "don't do it".

> More warnings at-least show that we are working
> with malicious VMM.

That argument does not hold water for me.

You can argue that a counter can be kept, or that a WARN_ON_ONCE() is
appropriate, or that a printk_ratelimited() is nice.  But, allowing an
untrusted software component to write unlimited warnings to the kernel
console is utterly nonsensical.

By the same argument, any userspace exploit attempts could spew warnings
to the console also so that we can tell we are working with malicious
userspace.

>>> +/*
>>> + * Wrapper for the semi-common case where we need single output
>>> value (R11).
>>> + */
>>> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64
>>> r14, u64 r15)
>>> +{
>>> +
>>> +    struct tdvmcall_output out = {0};
>>> +    u64 err;
>>> +
>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, &out);
>>> +
>>> +    WARN_ON(err);
>>> +
>>> +    return out.r11;
>>> +}
>>> +
>>
>> But you introduced __tdvmcall and __tdcall assembly functions.  Why
>> aren't both of them used?
> This patch just adds helper functions. Its used by other patches in the
> series. __tdvmcall is used in this patch because we need to add more
> wrappers for it.

That needs to be mentioned in the changelog at least.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-20 19:59                               ` Dave Hansen
@ 2021-04-20 23:12                                 ` Kuppuswamy, Sathyanarayanan
  2021-04-20 23:42                                   ` Dave Hansen
  2021-04-20 23:53                                   ` Dan Williams
  0 siblings, 2 replies; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-20 23:12 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel



On 4/20/21 12:59 PM, Dave Hansen wrote:
> On 4/20/21 12:20 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>> approach is, it adds a few extra instructions for every
>>>> TDCALL use case when compared to distributed checks. Although
>>>> it's a bit less efficient, it's worth it to make the code more
>>>> readable.
>>>
>>> What's a "distributed check"?
>>
>> It should be "distributed TDVMCALL/TDCALL inline assembly calls"
> 
> It's still not clear to what that refers.

I am just comparing the performance cost of using generic TDCALL()/TDVMCALL()
function implementation with "usage specific" (GetQuote,MapGPA, HLT,etc) custom
TDCALL()/TDVMCALL() inline assembly implementation.
> 
>>> This also doesn't talk at all about why this approach was chosen versus
>>> inline assembly.  You're going to be asked "why not use inline asm?"
>> "To make the core more readable and less error prone." I have added this
>> info in above paragraph. Do you think we need more argument to
>> justify our approach?
> 
> Yes, you need much more justification.  That's pretty generic and
> non-specific.
readability is one of the main motivation for not choosing inline
assembly. Since number of lines of instructions (with comments) are over
70, using inline assembly made it hard to read. Another reason is, since we
are using many registers (R8-R15, R[A-D]X)) in TDVMCAL/TDCALL operation, we are
not sure whether some older compiler can follow our specified inline assembly
constraints.
> 
>>>> +    /*
>>>> +     * Expose R10 - R15, i.e. all GPRs that may be used by TDVMCALLs
>>>> +     * defined in the GHCI.  Note, RAX and RCX are consumed, but
>>>> only by
>>>> +     * TDX-Module and so don't need to be listed in the mask.
>>>> +     */
>>>
>>> "GCHI" is out of the blue here.  So is "TDX-Module".  There needs to be
>>> more context.
>> ok. will add it. Do you want GHCI spec reference with section id here?
> 
> Absolutely not.  I dislike all of the section references as-is.  Doesn't
> a comment like this say what you said above and also add context?
> 
> 	Expose every register currently used in the
> 	guest-to-host communication interface (GHCI).
ok.
> 
>>>> +    movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
>>>> +
>>>> +    tdcall
>>>> +
>>>> +    /* Panic if TDCALL reports failure. */
>>>> +    test %rax, %rax
>>>> +    jnz 2f
>>>
>>> Why panic?
>> As per spec, TDCALL should never fail. Note that it has nothing to do
>> with TDVMCALL function specific failure (which is reported via R10).
> 
> You've introduced two concepts here, without differentiating them.  You
> need to work to differentiate these two kinds of failure somewhere.  You
> can't simply refer to both as "failure".
will clarify it. I have assumed that once the user reads the spec, its easier
to understand.
> 
>>> Also, do you *REALLY* need to do this from assembly?  Can't it be done
>>> in the C wrapper?
>> Its common for all use cases of TDVMCALL (vendor specific, in/out, etc).
>> so added
>> it here.
> 
> That's not a good reason.  You could just as easily have a C wrapper
> which all uses of TDVMCALL go through.
Any reason for not preferring it in assembly code?
Also, using wrapper will add more complication for in/out instruction
substitution use case. please check the use case in following patch.
https://github.com/intel/tdx/commit/1b73f60aa5bb93554f3b15cd786a9b10b53c1543
> 
>>>> +    /* Propagate TDVMCALL success/failure to return value. */
>>>> +    mov %r10, %rax
>>>
>>> You just said it panic's on failure.  How can this propagate failure?
>> we use panic for TDCALL failure. But, R10 content used to identify
>> whether given
>> TDVMCALL function operation is successful or not.
> 
> As I said above, please endeavor to differentiate the two classes of
> failures.
> 
> Also, if the spec is violated, do you *REALLY* want to blow up the whole
> world with a panic?  I guess you can argue that it could have been
> something security-related that failed.  But, either way, you owe a
> description of why panic'ing is a good idea, not just blindly deferring
> to "the spec says this can't happen".
ok. will add some comments justifying our case.
> 
>>>> +    xor %r10d, %r10d
>>>> +    xor %r11d, %r11d
>>>> +    xor %r12d, %r12d
>>>> +    xor %r13d, %r13d
>>>> +    xor %r14d, %r14d
>>>> +    xor %r15d, %r15d
>>>> +
>>>> +    pop %r12
>>>> +    pop %r13
>>>> +    pop %r14
>>>> +    pop %r15
>>>> +
>>>> +    FRAME_END
>>>> +    ret
>>>> +2:
>>>> +    ud2
>>>> +.endm
>>>> +
>>>> +SYM_FUNC_START(__tdvmcall)
>>>> +    xor %r10, %r10
>>>> +    tdvmcall_core
>>>> +SYM_FUNC_END(__tdvmcall)
>>>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>>>> index 0d00dd50a6ff..1147e7e765d6 100644
>>>> --- a/arch/x86/kernel/tdx.c
>>>> +++ b/arch/x86/kernel/tdx.c
>>>> @@ -3,6 +3,36 @@
>>>>      #include <asm/tdx.h>
>>>>    +/*
>>>> + * Wrapper for the common case with standard output value (R10).
>>>> + */
>>>
>>> ... and oddly enough there is no explicit mention of R10 anywhere.  Why?
>> For Guest to Host call -> R10 holds TDCALL function id (which is 0 for
>> TDVMCALL). so
>> we don't need special argument.
>> After TDVMCALL execution, R10 value is returned via RAX.
> 
> OK... so this is how it works.  But why mention R10 in the comment?  Why
> is *THAT* worth mentioning?
its not needed. will remove it.
> 
>>>> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>>>> +{
>>>> +    u64 err;
>>>> +
>>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
>>>> +
>>>> +    WARN_ON(err);
>>>> +
>>>> +    return err;
>>>> +}
>>>
>>> Are there really *ZERO* reasons for a TDVMCALL to return an error?
>> No. Its useful for debugging TDVMCALL failures.
>>> Won't this let a malicious VMM spew endless warnings into the guest
>>> console?
>> As per GHCI spec, R10 will hold error code details which can be used to
>> determine
>> the type of TDVMCALL failure.
> 
> I would encourage you to stop citing the GCCI spec.  In all of these
> conversations, you seem to rely on it without considering the underlying
> reasons.  The fact that R10 is the error code is 100% irrelevant for
> this conversation.
> 
> It's also entirely possible that the host would have bugs, or forget to
> clear a bit somewhere, even if the spec says, "don't do it".
> 
>> More warnings at-least show that we are working
>> with malicious VMM.
> 
> That argument does not hold water for me.
> 
> You can argue that a counter can be kept, or that a WARN_ON_ONCE() is
> appropriate, or that a printk_ratelimited() is nice.  But, allowing an
> untrusted software component to write unlimited warnings to the kernel
> console is utterly nonsensical.
> 
> By the same argument, any userspace exploit attempts could spew warnings
> to the console also so that we can tell we are working with malicious
> userspace.
In our case, we will get WARN() output only if guest triggers TDCALL()/TDVMCALL()
right? So getting WARN() message for failure of guest triggered call is
justifiable right?
> 
>>>> +/*
>>>> + * Wrapper for the semi-common case where we need single output
>>>> value (R11).
>>>> + */
>>>> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64
>>>> r14, u64 r15)
>>>> +{
>>>> +
>>>> +    struct tdvmcall_output out = {0};
>>>> +    u64 err;
>>>> +
>>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, &out);
>>>> +
>>>> +    WARN_ON(err);
>>>> +
>>>> +    return out.r11;
>>>> +}
>>>> +
>>>
>>> But you introduced __tdvmcall and __tdcall assembly functions.  Why
>>> aren't both of them used?
>> This patch just adds helper functions. Its used by other patches in the
>> series. __tdvmcall is used in this patch because we need to add more
>> wrappers for it.
> 
> That needs to be mentioned in the changelog at least.
ok will do it.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-20 23:12                                 ` Kuppuswamy, Sathyanarayanan
@ 2021-04-20 23:42                                   ` Dave Hansen
  2021-04-23  1:09                                     ` Kuppuswamy, Sathyanarayanan
  2021-04-20 23:53                                   ` Dan Williams
  1 sibling, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-20 23:42 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 4/20/21 4:12 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 4/20/21 12:59 PM, Dave Hansen wrote:
>> On 4/20/21 12:20 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>>> approach is, it adds a few extra instructions for every
>>>>> TDCALL use case when compared to distributed checks. Although
>>>>> it's a bit less efficient, it's worth it to make the code more
>>>>> readable.
>>>>
>>>> What's a "distributed check"?
>>>
>>> It should be "distributed TDVMCALL/TDCALL inline assembly calls"
>>
>> It's still not clear to what that refers.
> 
> I am just comparing the performance cost of using generic 
> TDCALL()/TDVMCALL() function implementation with "usage specific"
> (GetQuote,MapGPA, HLT,etc) custom TDCALL()/TDVMCALL() inline assembly
> implementation.

So, I actually had an idea what you were talking about, but I have
*ZERO* idea what "distributed" means in this context.

I think you are trying to say something along the lines of:

	Just like syscalls, not all TDVMCALL/TDCALLs use the same set
	of argument registers.  The implementation here picks the
	current worst-case scenario for TDCALL (4 registers).  For
	TDCALLs with fewer than 4 arguments, there will end up being
	a few superfluous (cheap) instructions.  But, this approach
	maximizes code reuse.


>>>> This also doesn't talk at all about why this approach was
>>>> chosen versus inline assembly.  You're going to be asked "why
>>>> not use inline asm?"
>>> "To make the core more readable and less error prone." I have
>>> added this info in above paragraph. Do you think we need more
>>> argument to justify our approach?
>> 
>> Yes, you need much more justification.  That's pretty generic and 
>> non-specific.
> readability is one of the main motivation for not choosing inline 

I'm curious.  Is there a reason you are not choosing to use
capitalization in your replies?  I personally use capitalization as a
visual clue for where a reply starts.

I'm not sure whether this indicates that your keyboard is not
functioning properly, or that these replies are simply not important
enough to warrant the use of the Shift key.  Or, is it simply an
oversight?  Or, maybe I'm just being overly picky because I've been
working on these exact things with my third-grader a bit too much lately.

Either way, I personally would appreciate your attention to detail in
crafting writing that is easy to parse, since I'm the one that's going
to have to parse it.  Details show that you care about the content you
produce.  Even if you don't mean it, a lack of attention to detail (even
capital letters) can be perceived to mean that you do not care about
what you write.  If you don't care about it, why should the reader?

> assembly. Since number of lines of instructions (with comments) are 
> over 70, using inline assembly made it hard to read. Another reason 
> is, since we
> are using many registers (R8-R15, R[A-D]X)) in TDVMCAL/TDCALL 
> operation, we are not sure whether some older compiler can follow
> our specified inline assembly constraints.

As for the justification, that's much improved.  Please include that,
along with some careful review of the grammar.

>>>>> +    movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
>>>>> +
>>>>> +    tdcall
>>>>> +
>>>>> +    /* Panic if TDCALL reports failure. */
>>>>> +    test %rax, %rax
>>>>> +    jnz 2f
>>>>
>>>> Why panic?
>>> As per spec, TDCALL should never fail. Note that it has nothing to do
>>> with TDVMCALL function specific failure (which is reported via R10).
>>
>> You've introduced two concepts here, without differentiating them.  You
>> need to work to differentiate these two kinds of failure somewhere.  You
>> can't simply refer to both as "failure".
> will clarify it. I have assumed that once the user reads the spec, its
> easier
> to understand.

Your code should be 100% self-supporting without the spec.  The spec can
be there in a supportive role to help resolve ambiguity or add fine
detail.  But, I think this is a major, repeated problem with this patch
set: it relies too much on reviewers spending quality time with the spec.

>>>> Also, do you *REALLY* need to do this from assembly?  Can't it be done
>>>> in the C wrapper?
>>> Its common for all use cases of TDVMCALL (vendor specific, in/out, etc).
>>> so added
>>> it here.
>>
>> That's not a good reason.  You could just as easily have a C wrapper
>> which all uses of TDVMCALL go through.
> Any reason for not preferring it in assembly code?

Assembly is a last resort.  It should only be used for things that
simply can't be written in C or are horrific to understand and manage
when written in C.  A single statement like:

	BUG_ON(something);

does not qualify in my book as something that's horrific to write in C.

> Also, using wrapper will add more complication for in/out instruction
> substitution use case. please check the use case in following patch.
> https://github.com/intel/tdx/commit/1b73f60aa5bb93554f3b15cd786a9b10b53c1543

I'm seeing a repeated theme here.  The approach in this patch series,
and in this email thread in general appears to be one where the patch
submitter is doing as little work as possible and trying to make the
reviewer do as much work as possible.

This is a 300-line diff with all kinds of stuff going on in it.  I'm not
sure to what you are referring.  You haven't made it easy to figure out.

It would make it a lot easier if you pointed to a specific line, or
copied-and-pasted the code to which you refer.  I would really encourage
you to try to make your content easier for reviewers to digest:
Capitalize the start of sentences.  Make unambiguous references to code.
 Don't blindly cite the spec.  Fully express your thoughts.

You'll end up with happier reviewers instead of grumpy ones.

...
>>> More warnings at-least show that we are working
>>> with malicious VMM.
>>
>> That argument does not hold water for me.
>>
>> You can argue that a counter can be kept, or that a WARN_ON_ONCE() is
>> appropriate, or that a printk_ratelimited() is nice.  But, allowing an
>> untrusted software component to write unlimited warnings to the kernel
>> console is utterly nonsensical.
>>
>> By the same argument, any userspace exploit attempts could spew warnings
>> to the console also so that we can tell we are working with malicious
>> userspace.
> In our case, we will get WARN() output only if guest triggers
> TDCALL()/TDVMCALL()
> right? So getting WARN() message for failure of guest triggered call is
> justifiable right?

The output of these calls and thus the error code comes from another
piece of software, either the SEAM module or the VMM.

The error can be from one of several reasons:
 1. Guest error/bug where the guest provides a bad value.  This is
    probably the most likely scenario.  But, it's impossible to
    differentiate from the other cases because it's a guest bug.
 2. SEAM error/bug.  If the spec says "SEAM will not do this", then you
    can probably justify a WARN_ON_ONCE().  If the call is security-
    sensitve, like part of attestation, then you can't meaningfully
    recover from it and it probably deserves a BUG_ON().
 3. VMM error/bug/malice.  Again, you might be able to justify a
    WARN_ON_ONCE().  We do that for userspace that might be attacking
    the kernel.  These are *NEVER* fatal and should be rate-limited.

I don't see *ANYWHERE* in this list where an unbounded, unratelimited
WARN() is appropriate.  But, that's just my $0.02.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-20 23:12                                 ` Kuppuswamy, Sathyanarayanan
  2021-04-20 23:42                                   ` Dave Hansen
@ 2021-04-20 23:53                                   ` Dan Williams
  2021-04-20 23:59                                     ` Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 161+ messages in thread
From: Dan Williams @ 2021-04-20 23:53 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List

On Tue, Apr 20, 2021 at 4:12 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
[..]
> >>> Also, do you *REALLY* need to do this from assembly?  Can't it be done
> >>> in the C wrapper?
> >> Its common for all use cases of TDVMCALL (vendor specific, in/out, etc).
> >> so added
> >> it here.
> >

Can I ask a favor?

Please put a line break between quoted lines and your reply.

> > That's not a good reason.  You could just as easily have a C wrapper
> > which all uses of TDVMCALL go through.

...because this runs together when reading otherwise.

> Any reason for not preferring it in assembly code?
> Also, using wrapper will add more complication for in/out instruction
> substitution use case. please check the use case in following patch.
> https://github.com/intel/tdx/commit/1b73f60aa5bb93554f3b15cd786a9b10b53c1543

This commit still has open coded assembly for the TDVMCALL? I thought
we talked about it being unified with the common definition, or has
this patch not been reworked with that feedback yet? I expect there is
no performance reason why in/out need to get their own custom coded
TDVMCALL implementation. It should also be the case the failure should
behave the same as native in/out failure i.e. all ones on read
failure, and silent drops on write failure.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-20 23:53                                   ` Dan Williams
@ 2021-04-20 23:59                                     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-20 23:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List



On 4/20/21 4:53 PM, Dan Williams wrote:
> On Tue, Apr 20, 2021 at 4:12 PM Kuppuswamy, Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> [..]
>>>>> Also, do you *REALLY* need to do this from assembly?  Can't it be done
>>>>> in the C wrapper?
>>>> Its common for all use cases of TDVMCALL (vendor specific, in/out, etc).
>>>> so added
>>>> it here.
>>>
> 
> Can I ask a favor?
> 
> Please put a line break between quoted lines and your reply.

will do

> 
>>> That's not a good reason.  You could just as easily have a C wrapper
>>> which all uses of TDVMCALL go through.
> 
> ...because this runs together when reading otherwise.
> 
>> Any reason for not preferring it in assembly code?
>> Also, using wrapper will add more complication for in/out instruction
>> substitution use case. please check the use case in following patch.
>> https://github.com/intel/tdx/commit/1b73f60aa5bb93554f3b15cd786a9b10b53c1543
> 
> This commit still has open coded assembly for the TDVMCALL? I thought
> we talked about it being unified with the common definition, or has
> this patch not been reworked with that feedback yet? I expect there is
> no performance reason why in/out need to get their own custom coded
> TDVMCALL implementation. It should also be the case the failure should
> behave the same as native in/out failure i.e. all ones on read
> failure, and silent drops on write failure.
> 

That link is for older version. My next version addresses your review
comments (re-uses TDVMCALL() function). Although the patch is ready, I am
waiting to fix other review comments before sending the next version. I
have just shared that link to explain about the use case.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-20 23:42                                   ` Dave Hansen
@ 2021-04-23  1:09                                     ` Kuppuswamy, Sathyanarayanan
  2021-04-23  1:21                                       ` Dave Hansen
  0 siblings, 1 reply; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-23  1:09 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel



On 4/20/21 4:42 PM, Dave Hansen wrote:
> On 4/20/21 4:12 PM, Kuppuswamy, Sathyanarayanan wrote:
>> On 4/20/21 12:59 PM, Dave Hansen wrote:
>>> On 4/20/21 12:20 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>>>> approach is, it adds a few extra instructions for every
>>>>>> TDCALL use case when compared to distributed checks. Although
>>>>>> it's a bit less efficient, it's worth it to make the code more
>>>>>> readable.
>>>>>
>>>>> What's a "distributed check"?
>>>>
>>>> It should be "distributed TDVMCALL/TDCALL inline assembly calls"
>>>
>>> It's still not clear to what that refers.
>>
>> I am just comparing the performance cost of using generic
>> TDCALL()/TDVMCALL() function implementation with "usage specific"
>> (GetQuote,MapGPA, HLT,etc) custom TDCALL()/TDVMCALL() inline assembly
>> implementation.
> 
> So, I actually had an idea what you were talking about, but I have
> *ZERO* idea what "distributed" means in this context.
> 
> I think you are trying to say something along the lines of:
> 
> 	Just like syscalls, not all TDVMCALL/TDCALLs use the same set
> 	of argument registers.  The implementation here picks the
> 	current worst-case scenario for TDCALL (4 registers).  For
> 	TDCALLs with fewer than 4 arguments, there will end up being
> 	a few superfluous (cheap) instructions.  But, this approach
> 	maximizes code reuse.
> 

Yes, you are correct. I will word it better in my next version.

> 
>>>>> This also doesn't talk at all about why this approach was
>>>>> chosen versus inline assembly.  You're going to be asked "why
>>>>> not use inline asm?"
>>>> "To make the core more readable and less error prone." I have
>>>> added this info in above paragraph. Do you think we need more
>>>> argument to justify our approach?
>>>
>>> Yes, you need much more justification.  That's pretty generic and
>>> non-specific.
>> readability is one of the main motivation for not choosing inline
> 
> I'm curious.  Is there a reason you are not choosing to use
> capitalization in your replies?  I personally use capitalization as a
> visual clue for where a reply starts.
> 
> I'm not sure whether this indicates that your keyboard is not
> functioning properly, or that these replies are simply not important
> enough to warrant the use of the Shift key.  Or, is it simply an
> oversight?  Or, maybe I'm just being overly picky because I've been
> working on these exact things with my third-grader a bit too much lately.
> 
> Either way, I personally would appreciate your attention to detail in
> crafting writing that is easy to parse, since I'm the one that's going
> to have to parse it.  Details show that you care about the content you
> produce.  Even if you don't mean it, a lack of attention to detail (even
> capital letters) can be perceived to mean that you do not care about
> what you write.  If you don't care about it, why should the reader?
> 
>> assembly. Since number of lines of instructions (with comments) are
>> over 70, using inline assembly made it hard to read. Another reason
>> is, since we
>> are using many registers (R8-R15, R[A-D]X)) in TDVMCAL/TDCALL
>> operation, we are not sure whether some older compiler can follow
>> our specified inline assembly constraints.
> 
> As for the justification, that's much improved.  Please include that,
> along with some careful review of the grammar.

It's an oversight from my end. I will keep it in mind in my future
replies.


> 
>>>>>> +    movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx


>>>
>>> You've introduced two concepts here, without differentiating them.  You
>>> need to work to differentiate these two kinds of failure somewhere.  You
>>> can't simply refer to both as "failure".
>> will clarify it. I have assumed that once the user reads the spec, its
>> easier
>> to understand.
> 
> Your code should be 100% self-supporting without the spec.  The spec can
> be there in a supportive role to help resolve ambiguity or add fine
> detail.  But, I think this is a major, repeated problem with this patch
> set: it relies too much on reviewers spending quality time with the spec.
> 

I will review the patch set again and add necessary comments to fix this gap.

>>>>> Also, do you *REALLY* need to do this from assembly?  Can't it be done
>>>>> in the C wrapper?
>>>> Its common for all use cases of TDVMCALL (vendor specific, in/out, etc).
>>>> so added
>>>> it here.
>>>
>>> That's not a good reason.  You could just as easily have a C wrapper
>>> which all uses of TDVMCALL go through.
>> Any reason for not preferring it in assembly code?
> 
> Assembly is a last resort.  It should only be used for things that
> simply can't be written in C or are horrific to understand and manage
> when written in C.  A single statement like:
> 
> 	BUG_ON(something);
> 
> does not qualify in my book as something that's horrific to write in C.
> 
>> Also, using wrapper will add more complication for in/out instruction
>> substitution use case. please check the use case in following patch.
>> https://github.com/intel/tdx/commit/1b73f60aa5bb93554f3b15cd786a9b10b53c1543
> 
> I'm seeing a repeated theme here.  The approach in this patch series,
> and in this email thread in general appears to be one where the patch
> submitter is doing as little work as possible and trying to make the
> reviewer do as much work as possible.
> 
> This is a 300-line diff with all kinds of stuff going on in it.  I'm not
> sure to what you are referring.  You haven't made it easy to figure out.

I have pointed that patch to give reference to how in/out instructions
are substituted with tdvmcalls(). Specific implementation is spread across
multiple lines/files in that patch. So I did not include specific line
numbers.

But let me try to explain it here. What I meant by complication is,
for in/out instruction, we use alternative_io() to substitute in/out
instructions with tdg_in()/tdg_out() assembly calls. So we have to ensure
that we don't corrupt registers or stack from the substituted instructions

If you check the implementation of tdg_in()/tdg_out(), you will notice
that we have added code to preserve the caller registers. So, if we use
C wrapper for this use case, there is a chance that it might mess
the caller registers or stack.

	alternative_io("in" #bwl " %w2, %" #bw "0",			\
			"call tdg_in" #bwl, X86_FEATURE_TDX_GUEST,	\
			"=a"(value), "d"(port))

> 
> It would make it a lot easier if you pointed to a specific line, or
> copied-and-pasted the code to which you refer.  I would really encourage
> you to try to make your content easier for reviewers to digest:
> Capitalize the start of sentences.  Make unambiguous references to code.
>   Don't blindly cite the spec.  Fully express your thoughts.
> 
> You'll end up with happier reviewers instead of grumpy ones.

Got it. I will try to keep your suggestion in mind for future
communications.

> 
> ...
>>>> More warnings at-least show that we are working
>>>> with malicious VMM.
>>>

>> In our case, we will get WARN() output only if guest triggers
>> TDCALL()/TDVMCALL()
>> right? So getting WARN() message for failure of guest triggered call is
>> justifiable right?
> 
> The output of these calls and thus the error code comes from another
> piece of software, either the SEAM module or the VMM.
> 
> The error can be from one of several reasons:
>   1. Guest error/bug where the guest provides a bad value.  This is
>      probably the most likely scenario.  But, it's impossible to
>      differentiate from the other cases because it's a guest bug.
>   2. SEAM error/bug.  If the spec says "SEAM will not do this", then you
>      can probably justify a WARN_ON_ONCE().  If the call is security-
>      sensitve, like part of attestation, then you can't meaningfully
>      recover from it and it probably deserves a BUG_ON().
>   3. VMM error/bug/malice.  Again, you might be able to justify a
>      WARN_ON_ONCE().  We do that for userspace that might be attacking
>      the kernel.  These are *NEVER* fatal and should be rate-limited.
> 
> I don't see *ANYWHERE* in this list where an unbounded, unratelimited
> WARN() is appropriate.  But, that's just my $0.02.

WARN_ON_ONCE() will not work for our use case. Since tdvmcall()/tdcall()
can be triggered for multiple use cases. So we can't print errors only
once.

I will go with pr_warn_ratelimited() for this use case.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-23  1:09                                     ` Kuppuswamy, Sathyanarayanan
@ 2021-04-23  1:21                                       ` Dave Hansen
  2021-04-23  1:35                                         ` Andi Kleen
  0 siblings, 1 reply; 161+ messages in thread
From: Dave Hansen @ 2021-04-23  1:21 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 4/22/21 6:09 PM, Kuppuswamy, Sathyanarayanan wrote:
> But let me try to explain it here. What I meant by complication is,
> for in/out instruction, we use alternative_io() to substitute in/out
> instructions with tdg_in()/tdg_out() assembly calls. So we have to ensure
> that we don't corrupt registers or stack from the substituted instructions
> 
> If you check the implementation of tdg_in()/tdg_out(), you will notice
> that we have added code to preserve the caller registers. So, if we use
> C wrapper for this use case, there is a chance that it might mess
> the caller registers or stack.
> 
>     alternative_io("in" #bwl " %w2, %" #bw "0",            \
>             "call tdg_in" #bwl, X86_FEATURE_TDX_GUEST,    \
>             "=a"(value), "d"(port))

Are you saying that calling C functions from inline assembly might
corrupt the stack or registers?  Are you suggesting that you simply
can't call C functions from inline assembly?  Or, that you can't express
the register clobbers of a function call in inline assembly?

You might want to check around the kernel to see how other folks do it.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-23  1:21                                       ` Dave Hansen
@ 2021-04-23  1:35                                         ` Andi Kleen
  2021-04-23 15:15                                           ` Sean Christopherson
  0 siblings, 1 reply; 161+ messages in thread
From: Andi Kleen @ 2021-04-23  1:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel

On Thu, Apr 22, 2021 at 06:21:07PM -0700, Dave Hansen wrote:
> On 4/22/21 6:09 PM, Kuppuswamy, Sathyanarayanan wrote:
> > But let me try to explain it here. What I meant by complication is,
> > for in/out instruction, we use alternative_io() to substitute in/out
> > instructions with tdg_in()/tdg_out() assembly calls. So we have to ensure
> > that we don't corrupt registers or stack from the substituted instructions
> > 
> > If you check the implementation of tdg_in()/tdg_out(), you will notice
> > that we have added code to preserve the caller registers. So, if we use
> > C wrapper for this use case, there is a chance that it might mess
> > the caller registers or stack.
> > 
> >     alternative_io("in" #bwl " %w2, %" #bw "0",            \
> >             "call tdg_in" #bwl, X86_FEATURE_TDX_GUEST,    \
> >             "=a"(value), "d"(port))
> 
> Are you saying that calling C functions from inline assembly might
> corrupt the stack or registers?  Are you suggesting that you simply

It's possible, but you would need to mark a lot more registers clobbered
(the x86-64 ABI allows to clobber many registers)

I don't think the stack would be messed up, but there might be problems
with writing the correct unwind information (which tends to be tricky)

Usually it's better to avoid it.

-Andi


> can't call C functions from inline assembly?  Or, that you can't express
> the register clobbers of a function call in inline assembly?
> 
> You might want to check around the kernel to see how other folks do it.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-23  1:35                                         ` Andi Kleen
@ 2021-04-23 15:15                                           ` Sean Christopherson
  2021-04-23 15:28                                             ` Dan Williams
                                                               ` (2 more replies)
  0 siblings, 3 replies; 161+ messages in thread
From: Sean Christopherson @ 2021-04-23 15:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Kuppuswamy, Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

On Thu, Apr 22, 2021, Andi Kleen wrote:
> On Thu, Apr 22, 2021 at 06:21:07PM -0700, Dave Hansen wrote:
> > On 4/22/21 6:09 PM, Kuppuswamy, Sathyanarayanan wrote:
> > > But let me try to explain it here. What I meant by complication is,
> > > for in/out instruction, we use alternative_io() to substitute in/out
> > > instructions with tdg_in()/tdg_out() assembly calls. So we have to ensure
> > > that we don't corrupt registers or stack from the substituted instructions
> > > 
> > > If you check the implementation of tdg_in()/tdg_out(), you will notice
> > > that we have added code to preserve the caller registers. So, if we use
> > > C wrapper for this use case, there is a chance that it might mess
> > > the caller registers or stack.
> > > 
> > >     alternative_io("in" #bwl " %w2, %" #bw "0",            \
> > >             "call tdg_in" #bwl, X86_FEATURE_TDX_GUEST,    \

Has Intel "officially" switched to "tdg" as the acronym for TDX guest?  As much
as I dislike having to juggle "TDX host" vs "TDX guest" concepts, tdx_ vs tdg_
isn't any better IMO.  The latter looks an awful lot like a typo, grepping for
"tdx" to find relevant code will get fail (sometimes), and confusion seems
inevitable as keeping "TDX" out of guest code/comments/documentation will be
nigh impossible.

If we do decide to go with "tdg" for the guest stuff, then _all_ of the guest
stuff, file names included, should use tdg.  Maybe X86_FEATURE_TDX_GUEST could
be left as a breadcrumb for translating TDX->TDG.

> > >             "=a"(value), "d"(port))
> > 
> > Are you saying that calling C functions from inline assembly might
> > corrupt the stack or registers?  Are you suggesting that you simply
> 
> It's possible, but you would need to mark a lot more registers clobbered
> (the x86-64 ABI allows to clobber many registers)
> 
> I don't think the stack would be messed up, but there might be problems
> with writing the correct unwind information (which tends to be tricky)
> 
> Usually it's better to avoid it.

For me, the more important justification is that, if calling from alternative_io,
the input parameters will be in the wrong registers.  The OUT wrapper would be
especially gross as RAX (the value to write) isn't an input param, i.e. shifting
via "ignored" params wouldn't work.

But to Dave's point, that justfication needs to be in the changelog.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-23 15:15                                           ` Sean Christopherson
@ 2021-04-23 15:28                                             ` Dan Williams
  2021-04-23 15:38                                               ` Andi Kleen
  2021-04-23 15:50                                               ` Sean Christopherson
  2021-04-23 15:47                                             ` Andi Kleen
  2021-04-23 18:18                                             ` Kuppuswamy, Sathyanarayanan
  2 siblings, 2 replies; 161+ messages in thread
From: Dan Williams @ 2021-04-23 15:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andi Kleen, Dave Hansen, Kuppuswamy, Sathyanarayanan,
	Peter Zijlstra, Andy Lutomirski, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Linux Kernel Mailing List

On Fri, Apr 23, 2021 at 8:15 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Apr 22, 2021, Andi Kleen wrote:
> > On Thu, Apr 22, 2021 at 06:21:07PM -0700, Dave Hansen wrote:
> > > On 4/22/21 6:09 PM, Kuppuswamy, Sathyanarayanan wrote:
> > > > But let me try to explain it here. What I meant by complication is,
> > > > for in/out instruction, we use alternative_io() to substitute in/out
> > > > instructions with tdg_in()/tdg_out() assembly calls. So we have to ensure
> > > > that we don't corrupt registers or stack from the substituted instructions
> > > >
> > > > If you check the implementation of tdg_in()/tdg_out(), you will notice
> > > > that we have added code to preserve the caller registers. So, if we use
> > > > C wrapper for this use case, there is a chance that it might mess
> > > > the caller registers or stack.
> > > >
> > > >     alternative_io("in" #bwl " %w2, %" #bw "0",            \
> > > >             "call tdg_in" #bwl, X86_FEATURE_TDX_GUEST,    \
>
> Has Intel "officially" switched to "tdg" as the acronym for TDX guest?  As much
> as I dislike having to juggle "TDX host" vs "TDX guest" concepts, tdx_ vs tdg_
> isn't any better IMO.  The latter looks an awful lot like a typo, grepping for
> "tdx" to find relevant code will get fail (sometimes), and confusion seems
> inevitable as keeping "TDX" out of guest code/comments/documentation will be
> nigh impossible.
>
> If we do decide to go with "tdg" for the guest stuff, then _all_ of the guest
> stuff, file names included, should use tdg.  Maybe X86_FEATURE_TDX_GUEST could
> be left as a breadcrumb for translating TDX->TDG.
>
> > > >             "=a"(value), "d"(port))
> > >
> > > Are you saying that calling C functions from inline assembly might
> > > corrupt the stack or registers?  Are you suggesting that you simply
> >
> > It's possible, but you would need to mark a lot more registers clobbered
> > (the x86-64 ABI allows to clobber many registers)
> >
> > I don't think the stack would be messed up, but there might be problems
> > with writing the correct unwind information (which tends to be tricky)
> >
> > Usually it's better to avoid it.
>
> For me, the more important justification is that, if calling from alternative_io,
> the input parameters will be in the wrong registers.  The OUT wrapper would be
> especially gross as RAX (the value to write) isn't an input param, i.e. shifting
> via "ignored" params wouldn't work.
>
> But to Dave's point, that justfication needs to be in the changelog.

It's not clear to me that in()/out() need to be inline asm with an
alternative vs out-of-line function calls with a static branch?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-23 15:28                                             ` Dan Williams
@ 2021-04-23 15:38                                               ` Andi Kleen
  2021-04-23 15:50                                               ` Sean Christopherson
  1 sibling, 0 replies; 161+ messages in thread
From: Andi Kleen @ 2021-04-23 15:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: Sean Christopherson, Dave Hansen, Kuppuswamy, Sathyanarayanan,
	Peter Zijlstra, Andy Lutomirski, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Linux Kernel Mailing List

On Fri, Apr 23, 2021 at 08:28:45AM -0700, Dan Williams wrote:
> On Fri, Apr 23, 2021 at 8:15 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Apr 22, 2021, Andi Kleen wrote:
> > > On Thu, Apr 22, 2021 at 06:21:07PM -0700, Dave Hansen wrote:
> > > > On 4/22/21 6:09 PM, Kuppuswamy, Sathyanarayanan wrote:
> > > > > But let me try to explain it here. What I meant by complication is,
> > > > > for in/out instruction, we use alternative_io() to substitute in/out
> > > > > instructions with tdg_in()/tdg_out() assembly calls. So we have to ensure
> > > > > that we don't corrupt registers or stack from the substituted instructions
> > > > >
> > > > > If you check the implementation of tdg_in()/tdg_out(), you will notice
> > > > > that we have added code to preserve the caller registers. So, if we use
> > > > > C wrapper for this use case, there is a chance that it might mess
> > > > > the caller registers or stack.
> > > > >
> > > > >     alternative_io("in" #bwl " %w2, %" #bw "0",            \
> > > > >             "call tdg_in" #bwl, X86_FEATURE_TDX_GUEST,    \
> >
> > Has Intel "officially" switched to "tdg" as the acronym for TDX guest?  As much
> > as I dislike having to juggle "TDX host" vs "TDX guest" concepts, tdx_ vs tdg_
> > isn't any better IMO.  The latter looks an awful lot like a typo, grepping for
> > "tdx" to find relevant code will get fail (sometimes), and confusion seems
> > inevitable as keeping "TDX" out of guest code/comments/documentation will be
> > nigh impossible.
> >
> > If we do decide to go with "tdg" for the guest stuff, then _all_ of the guest
> > stuff, file names included, should use tdg.  Maybe X86_FEATURE_TDX_GUEST could
> > be left as a breadcrumb for translating TDX->TDG.
> >
> > > > >             "=a"(value), "d"(port))
> > > >
> > > > Are you saying that calling C functions from inline assembly might
> > > > corrupt the stack or registers?  Are you suggesting that you simply
> > >
> > > It's possible, but you would need to mark a lot more registers clobbered
> > > (the x86-64 ABI allows to clobber many registers)
> > >
> > > I don't think the stack would be messed up, but there might be problems
> > > with writing the correct unwind information (which tends to be tricky)
> > >
> > > Usually it's better to avoid it.
> >
> > For me, the more important justification is that, if calling from alternative_io,
> > the input parameters will be in the wrong registers.  The OUT wrapper would be
> > especially gross as RAX (the value to write) isn't an input param, i.e. shifting
> > via "ignored" params wouldn't work.
> >
> > But to Dave's point, that justfication needs to be in the changelog.
> 
> It's not clear to me that in()/out() need to be inline asm with an
> alternative vs out-of-line function calls with a static branch?

I doubt it matters at all on a modern machine (the cost of a IO port
access is many orders of magnitudes greater than a call), but it might
have mattered on really old systems, like 486 class. Maybe if someone
is still running those moving it out of line could be a problem.

But likely it's fine.

I think actually for the main kernel we could just rely on #VE here
and drop it all.
Doing it without #VE only really matters for the old boot code, where
performance doesn't really matter.

-Andi

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-23 15:15                                           ` Sean Christopherson
  2021-04-23 15:28                                             ` Dan Williams
@ 2021-04-23 15:47                                             ` Andi Kleen
  2021-04-23 18:18                                             ` Kuppuswamy, Sathyanarayanan
  2 siblings, 0 replies; 161+ messages in thread
From: Andi Kleen @ 2021-04-23 15:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, Kuppuswamy, Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

> 
> Has Intel "officially" switched to "tdg" as the acronym for TDX guest?  As much

Just for the global symbols to avoid conflicts with the tdx host code.

-Andi

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-23 15:28                                             ` Dan Williams
  2021-04-23 15:38                                               ` Andi Kleen
@ 2021-04-23 15:50                                               ` Sean Christopherson
  1 sibling, 0 replies; 161+ messages in thread
From: Sean Christopherson @ 2021-04-23 15:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andi Kleen, Dave Hansen, Kuppuswamy, Sathyanarayanan,
	Peter Zijlstra, Andy Lutomirski, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Linux Kernel Mailing List

On Fri, Apr 23, 2021, Dan Williams wrote:
> On Fri, Apr 23, 2021 at 8:15 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Apr 22, 2021, Andi Kleen wrote:
> > > On Thu, Apr 22, 2021 at 06:21:07PM -0700, Dave Hansen wrote:
> > > > On 4/22/21 6:09 PM, Kuppuswamy, Sathyanarayanan wrote:
> > > > > But let me try to explain it here. What I meant by complication is,
> > > > > for in/out instruction, we use alternative_io() to substitute in/out
> > > > > instructions with tdg_in()/tdg_out() assembly calls. So we have to ensure
> > > > > that we don't corrupt registers or stack from the substituted instructions
> > > > >
> > > > > If you check the implementation of tdg_in()/tdg_out(), you will notice
> > > > > that we have added code to preserve the caller registers. So, if we use
> > > > > C wrapper for this use case, there is a chance that it might mess
> > > > > the caller registers or stack.
> > > > >
> > > > >     alternative_io("in" #bwl " %w2, %" #bw "0",            \
> > > > >             "call tdg_in" #bwl, X86_FEATURE_TDX_GUEST,    \
> >
> > Has Intel "officially" switched to "tdg" as the acronym for TDX guest?  As much
> > as I dislike having to juggle "TDX host" vs "TDX guest" concepts, tdx_ vs tdg_
> > isn't any better IMO.  The latter looks an awful lot like a typo, grepping for
> > "tdx" to find relevant code will get fail (sometimes), and confusion seems
> > inevitable as keeping "TDX" out of guest code/comments/documentation will be
> > nigh impossible.
> >
> > If we do decide to go with "tdg" for the guest stuff, then _all_ of the guest
> > stuff, file names included, should use tdg.  Maybe X86_FEATURE_TDX_GUEST could
> > be left as a breadcrumb for translating TDX->TDG.
> >
> > > > >             "=a"(value), "d"(port))
> > > >
> > > > Are you saying that calling C functions from inline assembly might
> > > > corrupt the stack or registers?  Are you suggesting that you simply
> > >
> > > It's possible, but you would need to mark a lot more registers clobbered
> > > (the x86-64 ABI allows to clobber many registers)
> > >
> > > I don't think the stack would be messed up, but there might be problems
> > > with writing the correct unwind information (which tends to be tricky)
> > >
> > > Usually it's better to avoid it.
> >
> > For me, the more important justification is that, if calling from alternative_io,
> > the input parameters will be in the wrong registers.  The OUT wrapper would be
> > especially gross as RAX (the value to write) isn't an input param, i.e. shifting
> > via "ignored" params wouldn't work.
> >
> > But to Dave's point, that justfication needs to be in the changelog.
> 
> It's not clear to me that in()/out() need to be inline asm with an
> alternative vs out-of-line function calls with a static branch?

Code footprint is the main argument, especially since Intel presumably is hoping
most distros will build their generic kernels with CONFIG_TDX_GUEST=y.  IIRC, a
carefully crafted assembly helper with a custom ABI can keep the overhead to
three bytes (or maybe even one byte?) versus raw IN and OUT.  Using a static
branch would incur significantly more overhead since the register prologues would
be different.

Or are you saying make the inb/outb() helpers themselves functions?  That would
be quite brutal, too.  

On a related topic, this changelog also needs to justify changing "Nd" to "d".
Maybe no one cares too much about imm8 port I/O and code footprint, but
effectively killing off "IN AL, imm8" should at least be called out.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-23 15:15                                           ` Sean Christopherson
  2021-04-23 15:28                                             ` Dan Williams
  2021-04-23 15:47                                             ` Andi Kleen
@ 2021-04-23 18:18                                             ` Kuppuswamy, Sathyanarayanan
  2 siblings, 0 replies; 161+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-23 18:18 UTC (permalink / raw)
  To: Sean Christopherson, Andi Kleen
  Cc: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel



On 4/23/21 8:15 AM, Sean Christopherson wrote:
> Has Intel "officially" switched to "tdg" as the acronym for TDX guest?  As much
> as I dislike having to juggle "TDX host" vs "TDX guest" concepts, tdx_ vs tdg_
> isn't any better IMO.  

When we merged both host and guest kernel into the same code base, we hit some
name conflicts (due to using tdx_ prefix in both host/guest code). So in order to
avoid such issues in future we decided to go with tdg/tdh combination. we thought
its good enough for kernel function/variable names.

The latter looks an awful lot like a typo, grepping for
> "tdx" to find relevant code will get fail (sometimes), and confusion seems
> inevitable as keeping "TDX" out of guest code/comments/documentation will be
> nigh impossible.

tdg/tdh combination is only used within kernel code. But in sections which are
visible to users (kernel config and command line option), we still use
tdx_guest/tdx_host combination.


> 
> If we do decide to go with "tdg" for the guest stuff, then_all_  of the guest
> stuff, file names included, should use tdg.  Maybe X86_FEATURE_TDX_GUEST could
> be left as a breadcrumb for translating TDX->TDG.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 161+ messages in thread

end of thread, other threads:[~2021-04-23 18:18 UTC | newest]

Thread overview: 161+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-06  3:02 Test Email sathyanarayanan.kuppuswamy
2021-02-05 23:38 ` [RFC v1 00/26] Add TDX Guest Support Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 01/26] x86/paravirt: Introduce CONFIG_PARAVIRT_XL Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 02/26] x86/cpufeatures: Add TDX Guest CPU feature Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 03/26] x86/cpufeatures: Add is_tdx_guest() interface Kuppuswamy Sathyanarayanan
2021-04-01 21:08   ` Dave Hansen
2021-04-01 21:15     ` Kuppuswamy, Sathyanarayanan
2021-04-01 21:19       ` Dave Hansen
2021-04-01 22:25         ` Kuppuswamy, Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 04/26] x86/tdx: Get TD execution environment information via TDINFO Kuppuswamy Sathyanarayanan
2021-02-08 10:00   ` Peter Zijlstra
2021-02-08 19:10     ` Kuppuswamy, Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 05/26] x86/traps: Add #VE support for TDX guest Kuppuswamy Sathyanarayanan
2021-02-08 10:20   ` Peter Zijlstra
2021-02-08 16:23     ` Andi Kleen
2021-02-08 16:33       ` Peter Zijlstra
2021-02-08 16:46         ` Sean Christopherson
2021-02-08 16:59           ` Peter Zijlstra
2021-02-08 19:05             ` Kuppuswamy, Sathyanarayanan
2021-02-08 16:46         ` Andi Kleen
2021-02-12 19:20   ` Dave Hansen
2021-02-12 19:47   ` Andy Lutomirski
2021-02-12 20:06     ` Sean Christopherson
2021-02-12 20:17       ` Dave Hansen
2021-02-12 20:37         ` Sean Christopherson
2021-02-12 20:46           ` Dave Hansen
2021-02-12 20:54             ` Sean Christopherson
2021-02-12 21:06               ` Dave Hansen
2021-02-12 21:37                 ` Sean Christopherson
2021-02-12 21:47                   ` Andy Lutomirski
2021-02-12 21:48                     ` Dave Hansen
2021-02-14 19:33                       ` Andi Kleen
2021-02-14 19:54                         ` Andy Lutomirski
2021-02-12 20:20       ` Andy Lutomirski
2021-02-12 20:44         ` Sean Christopherson
2021-02-05 23:38 ` [RFC v1 06/26] x86/tdx: Add HLT " Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 07/26] x86/tdx: Wire up KVM hypercalls Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 08/26] x86/tdx: Add MSR support for TDX guest Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 09/26] x86/tdx: Handle CPUID via #VE Kuppuswamy Sathyanarayanan
2021-02-05 23:42   ` Andy Lutomirski
2021-02-07 14:13     ` Kirill A. Shutemov
2021-02-07 16:01       ` Dave Hansen
2021-02-07 20:29         ` Kirill A. Shutemov
2021-02-07 22:31           ` Dave Hansen
2021-02-07 22:45             ` Andy Lutomirski
2021-02-08 17:10               ` Sean Christopherson
2021-02-08 17:35                 ` Andy Lutomirski
2021-02-08 17:47                   ` Sean Christopherson
2021-03-18 21:30               ` [PATCH v1 1/1] x86/tdx: Add tdcall() and tdvmcall() helper functions Kuppuswamy Sathyanarayanan
2021-03-19 16:55                 ` Sean Christopherson
2021-03-19 17:42                   ` Kuppuswamy, Sathyanarayanan
2021-03-19 18:22                     ` Dave Hansen
2021-03-19 19:58                       ` Kuppuswamy, Sathyanarayanan
2021-03-26 23:38                         ` [PATCH v2 1/1] x86/tdx: Add __tdcall() and __tdvmcall() " Kuppuswamy Sathyanarayanan
2021-04-20 17:36                           ` Dave Hansen
2021-04-20 19:20                             ` Kuppuswamy, Sathyanarayanan
2021-04-20 19:59                               ` Dave Hansen
2021-04-20 23:12                                 ` Kuppuswamy, Sathyanarayanan
2021-04-20 23:42                                   ` Dave Hansen
2021-04-23  1:09                                     ` Kuppuswamy, Sathyanarayanan
2021-04-23  1:21                                       ` Dave Hansen
2021-04-23  1:35                                         ` Andi Kleen
2021-04-23 15:15                                           ` Sean Christopherson
2021-04-23 15:28                                             ` Dan Williams
2021-04-23 15:38                                               ` Andi Kleen
2021-04-23 15:50                                               ` Sean Christopherson
2021-04-23 15:47                                             ` Andi Kleen
2021-04-23 18:18                                             ` Kuppuswamy, Sathyanarayanan
2021-04-20 23:53                                   ` Dan Williams
2021-04-20 23:59                                     ` Kuppuswamy, Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 10/26] x86/io: Allow to override inX() and outX() implementation Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 11/26] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 12/26] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
2021-04-01 19:56   ` Dave Hansen
2021-04-01 22:26     ` Sean Christopherson
2021-04-01 22:53       ` Dave Hansen
2021-02-05 23:38 ` [RFC v1 13/26] x86/tdx: Handle MWAIT, MONITOR and WBINVD Kuppuswamy Sathyanarayanan
2021-02-05 23:43   ` Andy Lutomirski
2021-02-05 23:54     ` Kuppuswamy, Sathyanarayanan
2021-02-06  1:05       ` Andy Lutomirski
2021-03-27  0:18         ` [PATCH v1 1/1] " Kuppuswamy Sathyanarayanan
2021-03-27  2:40           ` Andy Lutomirski
2021-03-27  3:40             ` Kuppuswamy, Sathyanarayanan
2021-03-27 16:03               ` Andy Lutomirski
2021-03-27 22:54                 ` [PATCH v2 " Kuppuswamy Sathyanarayanan
2021-03-29 17:14                   ` Dave Hansen
2021-03-29 21:55                     ` Kuppuswamy, Sathyanarayanan
2021-03-29 22:02                       ` Dave Hansen
2021-03-29 22:09                         ` Kuppuswamy, Sathyanarayanan
2021-03-29 22:12                           ` Dave Hansen
2021-03-29 22:42                             ` Kuppuswamy, Sathyanarayanan
2021-03-29 23:16                             ` [PATCH v3 " Kuppuswamy Sathyanarayanan
2021-03-29 23:23                               ` Andy Lutomirski
2021-03-29 23:37                                 ` Kuppuswamy, Sathyanarayanan
2021-03-29 23:42                                   ` Sean Christopherson
2021-03-29 23:58                                     ` Andy Lutomirski
2021-03-30  2:04                                       ` Andi Kleen
2021-03-30  2:58                                         ` Andy Lutomirski
2021-03-30 15:14                                           ` Sean Christopherson
2021-03-30 16:37                                             ` Andy Lutomirski
2021-03-30 16:57                                               ` Sean Christopherson
2021-04-07 15:24                                                 ` Andi Kleen
2021-03-31 21:09                                           ` [PATCH v4 " Kuppuswamy Sathyanarayanan
2021-03-31 21:49                                             ` Dave Hansen
2021-03-31 22:29                                               ` Kuppuswamy, Sathyanarayanan
2021-03-31 21:53                                             ` Sean Christopherson
2021-03-31 22:00                                               ` Dave Hansen
2021-03-31 22:06                                                 ` Sean Christopherson
2021-03-31 22:11                                                   ` Dave Hansen
2021-03-31 22:28                                                     ` Kuppuswamy, Sathyanarayanan
2021-03-31 22:32                                                       ` Sean Christopherson
2021-03-31 22:34                                                       ` Dave Hansen
2021-04-01  3:28                                                         ` Andi Kleen
2021-04-01  3:46                                                           ` Dave Hansen
2021-04-01  4:24                                                             ` Andi Kleen
2021-04-01  4:51                                                               ` [PATCH v5 " Kuppuswamy Sathyanarayanan
2021-03-29 23:39                                 ` [PATCH v3 " Sean Christopherson
2021-03-29 23:38                               ` Dave Hansen
2021-03-30  4:56           ` [PATCH v1 " Xiaoyao Li
2021-03-30 15:00             ` Andi Kleen
2021-03-30 15:10               ` Dave Hansen
2021-03-30 17:02                 ` Kuppuswamy, Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 14/26] ACPI: tables: Add multiprocessor wake-up support Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 15/26] x86/boot: Add a trampoline for APs booting in 64-bit mode Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 16/26] x86/boot: Avoid #VE during compressed boot for TDX platforms Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 17/26] x86/boot: Avoid unnecessary #VE during boot process Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 18/26] x86/topology: Disable CPU hotplug support for TDX platforms Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 19/26] x86/tdx: Forcefully disable legacy PIC for TDX guests Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 20/26] x86/tdx: Introduce INTEL_TDX_GUEST config option Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 21/26] x86/mm: Move force_dma_unencrypted() to common code Kuppuswamy Sathyanarayanan
2021-04-01 20:06   ` Dave Hansen
2021-04-06 15:37     ` Kirill A. Shutemov
2021-04-06 16:11       ` Dave Hansen
2021-04-06 16:37         ` Kirill A. Shutemov
2021-02-05 23:38 ` [RFC v1 22/26] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Kuppuswamy Sathyanarayanan
2021-04-01 20:13   ` Dave Hansen
2021-04-06 15:54     ` Kirill A. Shutemov
2021-04-06 16:12       ` Dave Hansen
2021-02-05 23:38 ` [RFC v1 23/26] x86/tdx: Make pages shared in ioremap() Kuppuswamy Sathyanarayanan
2021-04-01 20:26   ` Dave Hansen
2021-04-06 16:00     ` Kirill A. Shutemov
2021-04-06 16:14       ` Dave Hansen
2021-02-05 23:38 ` [RFC v1 24/26] x86/tdx: Add helper to do MapGPA TDVMALL Kuppuswamy Sathyanarayanan
2021-02-05 23:38 ` [RFC v1 25/26] x86/tdx: Make DMA pages shared Kuppuswamy Sathyanarayanan
2021-04-01 21:01   ` Dave Hansen
2021-04-06 16:31     ` Kirill A. Shutemov
2021-04-06 16:38       ` Dave Hansen
2021-04-06 17:16         ` Sean Christopherson
2021-02-05 23:38 ` [RFC v1 26/26] x86/kvm: Use bounce buffers for TD guest Kuppuswamy Sathyanarayanan
2021-04-01 21:17   ` Dave Hansen
2021-02-06  3:04 ` Test Email sathyanarayanan.kuppuswamy
2021-02-06  6:24 ` [RFC v1 00/26] Add TDX Guest Support sathyanarayanan.kuppuswamy
2021-03-31 21:38 ` Kuppuswamy, Sathyanarayanan
2021-04-02  0:02 ` Dave Hansen
2021-04-02  2:48   ` Andi Kleen
2021-04-02 15:27     ` Dave Hansen
2021-04-02 21:32       ` Andi Kleen
2021-04-03 16:26         ` Dave Hansen
2021-04-03 17:28           ` Andi Kleen
2021-04-04 15:02 ` Dave Hansen
2021-04-12 17:24   ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).