linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 00/32] Add TDX Guest Support
@ 2021-04-26 18:01 Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL Kuppuswamy Sathyanarayanan
                   ` (32 more replies)
  0 siblings, 33 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Hi All,

NOTE: This series is not ready for wide public review. It is being
specifically posted so that Peter Z and other experts on the entry
code can look for problems with the new exception handler (#VE).
That's also why x86@ is not being spammed.

Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. This series adds the bare-minimum
support to run a TDX guest. The host-side support will be submitted
separately. Also support for advanced TD guest features like attestation
or debug-mode will be submitted separately. Also, at this point it is not
secure with some known holes in drivers, and also hasn’t been fully audited
and fuzzed yet.

TDX has a lot of similarities to SEV. It enhances confidentiality and
of guest memory and state (like registers) and includes a new exception
(#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
yet), TDX limits the host's ability to effect changes in the guest
physical address space.

In contrast to the SEV code in the kernel, TDX guest memory is integrity
protected and isolated; the host is prevented from accessing guest
memory (even ciphertext).

The TDX architecture also includes a new CPU mode called
Secure-Arbitration Mode (SEAM). The software (TDX module) running in this
mode arbitrates interactions between host and guest and implements many of
the guarantees of the TDX architecture.

Some of the key differences between TD and regular VM is,

1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
2. A new #VE exception handler is added. The TDX module injects #VE exception
   to the guest TD in cases of instructions that need to be emulated, disallowed
   MSR accesses, subset of CPUID leaves, etc.
3. By default memory is marked as private, and TD will selectively share it with
   VMM based on need.
4. Remote attestation is supported to enable a third party (either the owner of
   the workload or a user of the services provided by the workload) to establish
   that the workload is running on an Intel-TDX-enabled platform located within a
   TD prior to providing that workload data.

You can find TDX related documents in the following link.

https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html

Changes since v1:
 * Implemented tdcall() and tdvmcall() helper functions in assembly and renamed
   them as __tdcall() and __tdvmcall().
 * Added do_general_protection() helper function to re-use protection
   code between #GP exception and TDX #VE exception handlers.
 * Addressed syscall gap issue in #VE handler support (for details check
   the commit log in "x86/traps: Add #VE support for TDX guest").
 * Modified patch titled "x86/tdx: Handle port I/O" to re-use common
   tdvmcall() helper function.
 * Added error handling support to MADT CPU wakeup code.
 * Introduced enum tdx_map_type to identify SHARED vs PRIVATE memory type.
 * Enabled shared memory in IOAPIC driver.
 * Added BINUTILS version info for TDCALL.
 * Changed the TDVMCALL vendor id from 0 to "TDX.KVM".
 * Replaced WARN() with pr_warn_ratelimited() in __tdvmcall() wrappers.
 * Fixed commit log and code comments related review comments.
 * Renamed patch titled # "x86/topology: Disable CPU hotplug support for TDX
   platforms" to "x86/topology: Disable CPU online/offline control for
   TDX guest"
 * Rebased on top of v5.12 kernel.


Erik Kaneda (1):
  ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Structure

Isaku Yamahata (1):
  x86/tdx: ioapic: Add shared bit for IOAPIC base address

Kirill A. Shutemov (16):
  x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  x86/tdx: Get TD execution environment information via TDINFO
  x86/traps: Add #VE support for TDX guest
  x86/tdx: Add HLT support for TDX guest
  x86/tdx: Wire up KVM hypercalls
  x86/tdx: Add MSR support for TDX guest
  x86/tdx: Handle CPUID via #VE
  x86/io: Allow to override inX() and outX() implementation
  x86/tdx: Handle port I/O
  x86/tdx: Handle in-kernel MMIO
  x86/mm: Move force_dma_unencrypted() to common code
  x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  x86/tdx: Make pages shared in ioremap()
  x86/tdx: Add helper to do MapGPA TDVMALL
  x86/tdx: Make DMA pages shared
  x86/kvm: Use bounce buffers for TD guest

Kuppuswamy Sathyanarayanan (10):
  x86/tdx: Introduce INTEL_TDX_GUEST config option
  x86/cpufeatures: Add TDX Guest CPU feature
  x86/x86: Add is_tdx_guest() interface
  x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  x86/traps: Add do_general_protection() helper function
  x86/tdx: Handle MWAIT, MONITOR and WBINVD
  ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Mailbox Structure
  ACPI/table: Print MADT Wake table information
  x86/acpi, x86/boot: Add multiprocessor wake-up support
  x86/topology: Disable CPU online/offline control for TDX guest

Sean Christopherson (4):
  x86/boot: Add a trampoline for APs booting in 64-bit mode
  x86/boot: Avoid #VE during compressed boot for TDX platforms
  x86/boot: Avoid unnecessary #VE during boot process
  x86/tdx: Forcefully disable legacy PIC for TDX guests

 arch/x86/Kconfig                         |  28 +-
 arch/x86/boot/compressed/Makefile        |   2 +
 arch/x86/boot/compressed/head_64.S       |  10 +-
 arch/x86/boot/compressed/misc.h          |   1 +
 arch/x86/boot/compressed/pgtable.h       |   2 +-
 arch/x86/boot/compressed/tdcall.S        |   9 +
 arch/x86/boot/compressed/tdx.c           |  32 ++
 arch/x86/include/asm/apic.h              |   3 +
 arch/x86/include/asm/cpufeatures.h       |   1 +
 arch/x86/include/asm/idtentry.h          |   4 +
 arch/x86/include/asm/io.h                |  24 +-
 arch/x86/include/asm/irqflags.h          |  38 +-
 arch/x86/include/asm/kvm_para.h          |  21 +
 arch/x86/include/asm/paravirt.h          |  22 +-
 arch/x86/include/asm/paravirt_types.h    |   3 +-
 arch/x86/include/asm/pgtable.h           |   3 +
 arch/x86/include/asm/realmode.h          |   1 +
 arch/x86/include/asm/tdx.h               | 176 +++++++++
 arch/x86/kernel/Makefile                 |   1 +
 arch/x86/kernel/acpi/boot.c              |  79 ++++
 arch/x86/kernel/apic/apic.c              |   8 +
 arch/x86/kernel/apic/io_apic.c           |  12 +-
 arch/x86/kernel/asm-offsets.c            |  22 ++
 arch/x86/kernel/head64.c                 |   3 +
 arch/x86/kernel/head_64.S                |  13 +-
 arch/x86/kernel/idt.c                    |   6 +
 arch/x86/kernel/paravirt.c               |   4 +-
 arch/x86/kernel/pci-swiotlb.c            |   2 +-
 arch/x86/kernel/smpboot.c                |   5 +
 arch/x86/kernel/tdcall.S                 | 361 +++++++++++++++++
 arch/x86/kernel/tdx-kvm.c                |  45 +++
 arch/x86/kernel/tdx.c                    | 480 +++++++++++++++++++++++
 arch/x86/kernel/topology.c               |   3 +-
 arch/x86/kernel/traps.c                  |  81 ++--
 arch/x86/mm/Makefile                     |   2 +
 arch/x86/mm/ioremap.c                    |   8 +-
 arch/x86/mm/mem_encrypt.c                |  75 ----
 arch/x86/mm/mem_encrypt_common.c         |  85 ++++
 arch/x86/mm/mem_encrypt_identity.c       |   1 +
 arch/x86/mm/pat/set_memory.c             |  48 ++-
 arch/x86/realmode/rm/header.S            |   1 +
 arch/x86/realmode/rm/trampoline_64.S     |  49 ++-
 arch/x86/realmode/rm/trampoline_common.S |   5 +-
 drivers/acpi/tables.c                    |  11 +
 include/acpi/actbl2.h                    |  26 +-
 45 files changed, 1654 insertions(+), 162 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdcall.S
 create mode 100644 arch/x86/boot/compressed/tdx.c
 create mode 100644 arch/x86/include/asm/tdx.h
 create mode 100644 arch/x86/kernel/tdcall.S
 create mode 100644 arch/x86/kernel/tdx-kvm.c
 create mode 100644 arch/x86/kernel/tdx.c
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-27 17:31   ` Borislav Petkov
  2021-04-26 18:01 ` [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option Kuppuswamy Sathyanarayanan
                   ` (31 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Split off halt paravirt calls from CONFIG_PARAVIRT_XXL into
a separate config option. It provides a middle ground for
not-so-deep paravirtulized environments.

CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
config would be a bloat for TDX.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/Kconfig                      |  4 +++
 arch/x86/boot/compressed/misc.h       |  1 +
 arch/x86/include/asm/irqflags.h       | 38 +++++++++++++++------------
 arch/x86/include/asm/paravirt.h       | 22 +++++++++-------
 arch/x86/include/asm/paravirt_types.h |  3 ++-
 arch/x86/kernel/paravirt.c            |  4 ++-
 arch/x86/mm/mem_encrypt_identity.c    |  1 +
 7 files changed, 44 insertions(+), 29 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..6b4b682af468 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -783,8 +783,12 @@ config PARAVIRT
 	  over full virtualization.  However, when run without a hypervisor
 	  the kernel is theoretically slower and slightly larger.
 
+config PARAVIRT_XL
+	bool
+
 config PARAVIRT_XXL
 	bool
+	select PARAVIRT_XL
 
 config PARAVIRT_DEBUG
 	bool "paravirt-ops debugging"
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 901ea5ebec22..4b84abe43765 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -9,6 +9,7 @@
  * paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
+#undef CONFIG_PARAVIRT_XL
 #undef CONFIG_PARAVIRT_XXL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 #undef CONFIG_KASAN
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 144d70ea4393..1688841893d7 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -59,27 +59,11 @@ static inline __cpuidle void native_halt(void)
 
 #endif
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_XL
 #include <asm/paravirt.h>
 #else
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
-
-static __always_inline unsigned long arch_local_save_flags(void)
-{
-	return native_save_fl();
-}
-
-static __always_inline void arch_local_irq_disable(void)
-{
-	native_irq_disable();
-}
-
-static __always_inline void arch_local_irq_enable(void)
-{
-	native_irq_enable();
-}
-
 /*
  * Used in the idle loop; sti takes one instruction cycle
  * to complete:
@@ -97,6 +81,26 @@ static inline __cpuidle void halt(void)
 {
 	native_halt();
 }
+#endif /* !__ASSEMBLY__ */
+#endif /* CONFIG_PARAVIRT_XL */
+
+#ifndef CONFIG_PARAVIRT_XXL
+#ifndef __ASSEMBLY__
+
+static __always_inline unsigned long arch_local_save_flags(void)
+{
+	return native_save_fl();
+}
+
+static __always_inline void arch_local_irq_disable(void)
+{
+	native_irq_disable();
+}
+
+static __always_inline void arch_local_irq_enable(void)
+{
+	native_irq_enable();
+}
 
 /*
  * For spinlocks, etc:
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4abf110e2243..2dbb6c9c7e98 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,6 +84,18 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 	PVOP_VCALL1(mmu.exit_mmap, mm);
 }
 
+#ifdef CONFIG_PARAVIRT_XL
+static inline void arch_safe_halt(void)
+{
+	PVOP_VCALL0(irq.safe_halt);
+}
+
+static inline void halt(void)
+{
+	PVOP_VCALL0(irq.halt);
+}
+#endif
+
 #ifdef CONFIG_PARAVIRT_XXL
 static inline void load_sp0(unsigned long sp0)
 {
@@ -145,16 +157,6 @@ static inline void __write_cr4(unsigned long x)
 	PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-static inline void arch_safe_halt(void)
-{
-	PVOP_VCALL0(irq.safe_halt);
-}
-
-static inline void halt(void)
-{
-	PVOP_VCALL0(irq.halt);
-}
-
 static inline void wbinvd(void)
 {
 	PVOP_VCALL0(cpu.wbinvd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index de87087d3bde..5261fba47ba5 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -177,7 +177,8 @@ struct pv_irq_ops {
 	struct paravirt_callee_save save_fl;
 	struct paravirt_callee_save irq_disable;
 	struct paravirt_callee_save irq_enable;
-
+#endif
+#ifdef CONFIG_PARAVIRT_XL
 	void (*safe_halt)(void);
 	void (*halt)(void);
 #endif
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c60222ab8ab9..d6d0b363fe70 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -322,9 +322,11 @@ struct paravirt_patch_template pv_ops = {
 	.irq.save_fl		= __PV_IS_CALLEE_SAVE(native_save_fl),
 	.irq.irq_disable	= __PV_IS_CALLEE_SAVE(native_irq_disable),
 	.irq.irq_enable		= __PV_IS_CALLEE_SAVE(native_irq_enable),
+#endif /* CONFIG_PARAVIRT_XXL */
+#ifdef CONFIG_PARAVIRT_XL
 	.irq.safe_halt		= native_safe_halt,
 	.irq.halt		= native_halt,
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_XL */
 
 	/* Mmu ops. */
 	.mmu.flush_tlb_user	= native_flush_tlb_local,
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 6c5eb6f3f14f..20d0cb116557 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -24,6 +24,7 @@
  * be extended when new paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
+#undef CONFIG_PARAVIRT_XL
 #undef CONFIG_PARAVIRT_XXL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 21:09   ` Randy Dunlap
  2021-04-26 18:01 ` [RFC v2 03/32] x86/cpufeatures: Add TDX Guest CPU feature Kuppuswamy Sathyanarayanan
                   ` (30 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Add INTEL_TDX_GUEST config option to selectively compile
TDX guest support.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/Kconfig | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6b4b682af468..932e6d759ba7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -875,6 +875,21 @@ config ACRN_GUEST
 	  IOT with small footprint and real-time features. More details can be
 	  found in https://projectacrn.org/.
 
+config INTEL_TDX_GUEST
+	bool "Intel Trusted Domain eXtensions Guest Support"
+	depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
+	depends on SECURITY
+	select PARAVIRT_XL
+	select X86_X2APIC
+	select SECURITY_LOCKDOWN_LSM
+	help
+	  Provide support for running in a trusted domain on Intel processors
+	  equipped with Trusted Domain eXtenstions. TDX is an new Intel
+	  technology that extends VMX and Memory Encryption with a new kind of
+	  virtual machine guest called Trust Domain (TD). A TD is designed to
+	  run in a CPU mode that protects the confidentiality of TD memory
+	  contents and the TD’s CPU state from other software, including VMM.
+
 endif #HYPERVISOR_GUEST
 
 source "arch/x86/Kconfig.cpu"
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 03/32] x86/cpufeatures: Add TDX Guest CPU feature
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 04/32] x86/x86: Add is_tdx_guest() interface Kuppuswamy Sathyanarayanan
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Add CPU feature detection for Trusted Domain Extensions support.
TDX feature adds capabilities to keep guest register state and
memory isolated from hypervisor.

For TDX guest platforms, executing CPUID(0x21, 0) will return
following values in EAX, EBX, ECX and EDX.

EAX:  Maximum sub-leaf number:  0
EBX/EDX/ECX:  Vendor string:

EBX =  "Inte"
EDX =  "lTDX"
ECX =  "    "

So when above condition is true, set X86_FEATURE_TDX_GUEST
feature cap bit

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/tdx.h         | 20 ++++++++++++++++++++
 arch/x86/kernel/Makefile           |  1 +
 arch/x86/kernel/head64.c           |  3 +++
 arch/x86/kernel/tdx.c              | 30 ++++++++++++++++++++++++++++++
 5 files changed, 55 insertions(+)
 create mode 100644 arch/x86/include/asm/tdx.h
 create mode 100644 arch/x86/kernel/tdx.c

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index cc96e26d69f7..d883df70c27b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -236,6 +236,7 @@
 #define X86_FEATURE_EPT_AD		( 8*32+17) /* Intel Extended Page Table access-dirty bit */
 #define X86_FEATURE_VMCALL		( 8*32+18) /* "" Hypervisor supports the VMCALL instruction */
 #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
+#define X86_FEATURE_TDX_GUEST		( 8*32+20) /* Trusted Domain Extensions Guest */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE		( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
new file mode 100644
index 000000000000..679500e807f3
--- /dev/null
+++ b/arch/x86/include/asm/tdx.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_TDX_H
+#define _ASM_X86_TDX_H
+
+#define TDX_CPUID_LEAF_ID	0x21
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+
+#include <asm/cpufeature.h>
+
+void __init tdx_early_init(void);
+
+#else // !CONFIG_INTEL_TDX_GUEST
+
+static inline void tdx_early_init(void) { };
+
+#endif /* CONFIG_INTEL_TDX_GUEST */
+
+#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 2ddf08351f0b..ea111bf50691 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,6 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 5e9beb77cafd..75f2401cb5db 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -40,6 +40,7 @@
 #include <asm/extable.h>
 #include <asm/trapnr.h>
 #include <asm/sev-es.h>
+#include <asm/tdx.h>
 
 /*
  * Manage page tables very early on.
@@ -491,6 +492,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 
 	kasan_early_init();
 
+	tdx_early_init();
+
 	idt_setup_early_handler();
 
 	copy_bootdata(__va(real_mode_data));
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
new file mode 100644
index 000000000000..f927e36769d5
--- /dev/null
+++ b/arch/x86/kernel/tdx.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2020 Intel Corporation */
+
+#include <asm/tdx.h>
+
+static inline bool cpuid_has_tdx_guest(void)
+{
+	u32 eax, signature[3];
+
+	if (cpuid_eax(0) < TDX_CPUID_LEAF_ID)
+		return false;
+
+	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &signature[0],
+			&signature[1], &signature[2]);
+
+	if (memcmp("IntelTDX    ", signature, 12))
+		return false;
+
+	return true;
+}
+
+void __init tdx_early_init(void)
+{
+	if (!cpuid_has_tdx_guest())
+		return;
+
+	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+	pr_info("TDX guest is initialized\n");
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 04/32] x86/x86: Add is_tdx_guest() interface
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (2 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 03/32] x86/cpufeatures: Add TDX Guest CPU feature Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions Kuppuswamy Sathyanarayanan
                   ` (28 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan, Sean Christopherson

Add helper function to detect TDX feature support. It will be used
to protect TDX specific code.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile |  1 +
 arch/x86/boot/compressed/tdx.c    | 32 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/tdx.h        |  8 ++++++++
 arch/x86/kernel/tdx.c             |  6 ++++++
 4 files changed, 47 insertions(+)
 create mode 100644 arch/x86/boot/compressed/tdx.c

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index e0bc3988c3fa..a2554621cefe 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -96,6 +96,7 @@ ifdef CONFIG_X86_64
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
new file mode 100644
index 000000000000..0a87c1775b67
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * tdx.c - Early boot code for TDX
+ */
+
+#include <asm/tdx.h>
+
+static int __ro_after_init tdx_guest = -1;
+
+static inline bool native_cpuid_has_tdx_guest(void)
+{
+	u32 eax = TDX_CPUID_LEAF_ID, signature[3] = {0};
+
+	if (native_cpuid_eax(0) < TDX_CPUID_LEAF_ID)
+		return false;
+
+	native_cpuid(&eax, &signature[0], &signature[1], &signature[2]);
+
+	if (memcmp("IntelTDX    ", signature, 12))
+		return false;
+
+	return true;
+}
+
+bool is_tdx_guest(void)
+{
+	if (tdx_guest < 0)
+		tdx_guest = native_cpuid_has_tdx_guest();
+
+	return !!tdx_guest;
+}
+
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 679500e807f3..69af72d08d3d 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -9,10 +9,18 @@
 
 #include <asm/cpufeature.h>
 
+/* Common API to check TDX support in decompression and common kernel code. */
+bool is_tdx_guest(void);
+
 void __init tdx_early_init(void);
 
 #else // !CONFIG_INTEL_TDX_GUEST
 
+static inline bool is_tdx_guest(void)
+{
+	return false;
+}
+
 static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index f927e36769d5..6a7193fead08 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -19,6 +19,12 @@ static inline bool cpuid_has_tdx_guest(void)
 	return true;
 }
 
+bool is_tdx_guest(void)
+{
+	return static_cpu_has(X86_FEATURE_TDX_GUEST);
+}
+EXPORT_SYMBOL_GPL(is_tdx_guest);
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (3 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 04/32] x86/x86: Add is_tdx_guest() interface Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 20:32   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 06/32] x86/tdx: Get TD execution environment information via TDINFO Kuppuswamy Sathyanarayanan
                   ` (27 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
expose guest state to the host.  This prevents the old hypercall
mechanisms from working. So to communicate with VMM, TDX
specification defines a new instruction called "tdcall".

In TDX based VM, since VMM is an untrusted entity, a intermediary
layer (TDX module) exists between host and guest to facilitate the
secure communication. And "tdcall" instruction  is used by the guest
to request services from TDX module. And a variant of "tdcall"
instruction (with specific arguments as defined by GHCI) is used by
the guest to request services from  VMM via the TDX module.

Implement common helper functions to communicate with the TDX Module
and VMM (using TDCALL instruction).
   
__tdvmcall() - function can be used to request services from the VMM.
   
__tdcall()  - function can be used to communicate with the TDX Module.

Also define two additional wrappers, tdvmcall() and tdvmcall_out_r11()
to cover common use cases of __tdvmcall() function. Since each use
case of __tdcall() is different, we don't need such wrappers for it.

Implement __tdcall() and __tdvmcall() helper functions in assembly.
Rationale behind choosing to use assembly over inline assembly are,

1. Since the number of lines of instructions (with comments) in
__tdvmcall() implementation is over 70, using inline assembly to
implement it will make it hard to read.
   
2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
TDVMCAL/TDCALL operation, if all these registers are included in
in-line assembly constraints, some of the older compilers may not
be able to meet this requirement.

Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
use the same set of argument registers. The implementation here picks
the current worst-case scenario for TDCALL (4 registers). For TDCALLs
with fewer than 4 arguments, there will end up being a few superfluous
(cheap) instructions.  But, this approach maximizes code reuse. The
same argument applies to __tdvmcall() function as well.

Current implementation of __tdvmcall()  includes error handling (ud2
on failure case) in assembly function instead of doing it in C wrapper
function. The reason behind this choice is, when adding support for
in/out instructions (refer to patch titled "x86/tdx: Handle port I/O"
in this series), we use alternative_io() to substitute in/out
instruction with  __tdvmcall() calls. So use of C wrappers is not trivial
in this case because the input parameters will be in the wrong registers
and it's tricky to include proper buffer code to make this happen.

For registers used by TDCALL instruction, please check TDX GHCI
specification, sec 2.4 and 3.

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Originally-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/tdx.h    |  26 +++++
 arch/x86/kernel/Makefile      |   2 +-
 arch/x86/kernel/asm-offsets.c |  22 ++++
 arch/x86/kernel/tdcall.S      | 200 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c         |  36 ++++++
 5 files changed, 285 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..6c3c71bb57a0 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,38 @@
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
+#include <linux/types.h>
+
+struct tdcall_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+struct tdvmcall_output {
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+};
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
 void __init tdx_early_init(void);
 
+/* Helper function used to communicate with the TDX module */
+u64 __tdcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+	     struct tdcall_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+	       struct tdvmcall_output *out);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..4a9885a9a28b 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
 #include <xen/interface/xen.h>
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
 #ifdef CONFIG_X86_32
 # include "asm-offsets_32.c"
 #else
@@ -75,6 +79,24 @@ static void __used common(void)
 	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+	BLANK();
+	/* Offset for fields in tdcall_output */
+	OFFSET(TDCALL_rcx, tdcall_output, rcx);
+	OFFSET(TDCALL_rdx, tdcall_output, rdx);
+	OFFSET(TDCALL_r8,  tdcall_output, r8);
+	OFFSET(TDCALL_r9,  tdcall_output, r9);
+	OFFSET(TDCALL_r10, tdcall_output, r10);
+	OFFSET(TDCALL_r11, tdcall_output, r11);
+
+	/* Offset for fields in tdvmcall_output */
+	OFFSET(TDVMCALL_r11, tdvmcall_output, r11);
+	OFFSET(TDVMCALL_r12, tdvmcall_output, r12);
+	OFFSET(TDVMCALL_r13, tdvmcall_output, r13);
+	OFFSET(TDVMCALL_r14, tdvmcall_output, r14);
+	OFFSET(TDVMCALL_r15, tdvmcall_output, r15);
+#endif
+
 	BLANK();
 	OFFSET(BP_scratch, boot_params, scratch);
 	OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..81af70c2acbd
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,200 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+
+/*
+ * Expose registers R10-R15 to VMM (for bitfield info
+ * refer to TDX GHCI specification).
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK	0xfc00
+
+/*
+ * TDX guests use the TDCALL instruction to make
+ * hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdcall()  - Used to communicate with the TDX module
+ *
+ * @arg1 (RDI) - TDCALL Leaf ID
+ * @arg2 (RSI) - Input parameter 1 passed to TDX module
+ *               via register RCX
+ * @arg2 (RDX) - Input parameter 2 passed to TDX module
+ *               via register RDX
+ * @arg3 (RCX) - Input parameter 3 passed to TDX module
+ *               via register R8
+ * @arg4 (R8)  - Input parameter 4 passed to TDX module
+ *               via register R9
+ * @arg5 (R9)  - struct tdcall_output pointer
+ *
+ * @out        - Return status of tdcall via RAX.
+ *
+ * NOTE: This function should only used for non TDVMCALL
+ *       use cases
+ */
+SYM_FUNC_START(__tdcall)
+	FRAME_BEGIN
+
+	/* Save non-volatile GPRs that are exposed to the VMM. */
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+
+	/* Move TDCALL Leaf ID to RAX */
+	mov %rdi, %rax
+	/* Move output pointer to R12 */
+	mov %r9, %r12
+	/* Move input param 4 to R9 */
+	mov %r8, %r9
+	/* Move input param 3 to R8 */
+	mov %rcx, %r8
+	/* Leave input param 2 in RDX */
+	/* Move input param 1 to RCX */
+	mov %rsi, %rcx
+
+	tdcall
+
+	/* Check for TDCALL success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check for a TDCALL output struct */
+	test %r12, %r12
+	jz 1f
+
+	/* Copy TDCALL result registers to output struct: */
+	movq %rcx, TDCALL_rcx(%r12)
+	movq %rdx, TDCALL_rdx(%r12)
+	movq %r8,  TDCALL_r8(%r12)
+	movq %r9,  TDCALL_r9(%r12)
+	movq %r10, TDCALL_r10(%r12)
+	movq %r11, TDCALL_r11(%r12)
+1:
+	/* Zero out registers exposed to the TDX Module. */
+	xor %rcx,  %rcx
+	xor %rdx,  %rdx
+	xor %r8d,  %r8d
+	xor %r9d,  %r9d
+	xor %r10d, %r10d
+	xor %r11d, %r11d
+
+	/* Restore non-volatile GPRs that are exposed to the VMM. */
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	FRAME_END
+	ret
+SYM_FUNC_END(__tdcall)
+
+/*
+ * do_tdvmcall()  - Used to communicate with the VMM.
+ *
+ * @arg1 (RDI)    - TDVMCALL function, e.g. exit reason
+ * @arg2 (RSI)    - Input parameter 1 passed to VMM
+ *                  via register R12
+ * @arg3 (RDX)    - Input parameter 2 passed to VMM
+ *                  via register R13
+ * @arg4 (RCX)    - Input parameter 3 passed to VMM
+ *                  via register R14
+ * @arg5 (R8)     - Input parameter 4 passed to VMM
+ *                  via register R15
+ * @arg6 (R9)     - struct tdvmcall_output pointer
+ *
+ * @out           - Return status of tdvmcall(R10) via RAX.
+ *
+ */
+SYM_CODE_START_LOCAL(do_tdvmcall)
+	FRAME_BEGIN
+
+	/* Save non-volatile GPRs that are exposed to the VMM. */
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+
+	/* Set TDCALL leaf ID to TDVMCALL (0) in RAX */
+	xor %eax, %eax
+	/* Move TDVMCALL function id (1st argument) to R11 */
+	mov %rdi, %r11
+	/* Move Input parameter 1-4 to R12-R15 */
+	mov %rsi, %r12
+	mov %rdx, %r13
+	mov %rcx, %r14
+	mov %r8,  %r15
+	/* Leave tdvmcall output pointer in R9 */
+
+	/*
+	 * Value of RCX is used by the TDX Module to determine which
+	 * registers are exposed to VMM. Each bit in RCX represents a
+	 * register id. You can find the bitmap details from TDX GHCI
+	 * spec.
+	 */
+	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+	tdcall
+
+	/*
+	 * Check for TDCALL success: 0 - Successful, otherwise failed.
+	 * If failed, there is an issue with TDX Module which is fatal
+	 * for the guest. So panic.
+	 */
+	test %rax, %rax
+	jnz 2f
+
+	/* Move TDVMCALL success/failure to RAX to return to user */
+	mov %r10, %rax
+
+	/* Check for TDVMCALL success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check for a TDVMCALL output struct */
+	test %r9, %r9
+	jz 1f
+
+	/* Copy TDVMCALL result registers to output struct: */
+	movq %r11, TDVMCALL_r11(%r9)
+	movq %r12, TDVMCALL_r12(%r9)
+	movq %r13, TDVMCALL_r13(%r9)
+	movq %r14, TDVMCALL_r14(%r9)
+	movq %r15, TDVMCALL_r15(%r9)
+1:
+	/*
+	 * Zero out registers exposed to the VMM to avoid
+	 * speculative execution with VMM-controlled values.
+	 */
+	xor %r10d, %r10d
+	xor %r11d, %r11d
+	xor %r12d, %r12d
+	xor %r13d, %r13d
+	xor %r14d, %r14d
+	xor %r15d, %r15d
+
+	/* Restore non-volatile GPRs that are exposed to the VMM. */
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	FRAME_END
+	ret
+2:
+	ud2
+SYM_CODE_END(do_tdvmcall)
+
+/* Helper function for standard type of TDVMCALL */
+SYM_FUNC_START(__tdvmcall)
+	/* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+	xor %r10, %r10
+	call do_tdvmcall
+	retq
+SYM_FUNC_END(__tdvmcall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 6a7193fead08..29c52128b9c0 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -1,8 +1,44 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (C) 2020 Intel Corporation */
 
+#define pr_fmt(fmt) "TDX: " fmt
+
 #include <asm/tdx.h>
 
+/*
+ * Wrapper for use case that checks for error code and print warning message.
+ */
+static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+	u64 err;
+
+	err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return err;
+}
+
+/*
+ * Wrapper for the semi-common case where we need single output value (R11).
+ */
+static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+
+	struct tdvmcall_output out = {0};
+	u64 err;
+
+	err = __tdvmcall(fn, r12, r13, r14, r15, &out);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return out.r11;
+}
+
 static inline bool cpuid_has_tdx_guest(void)
 {
 	u32 eax, signature[3];
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 06/32] x86/tdx: Get TD execution environment information via TDINFO
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (4 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 07/32] x86/traps: Add do_general_protection() helper function Kuppuswamy Sathyanarayanan
                   ` (26 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Per Guest-Host-Communication Interface (GHCI) for Intel Trust
Domain Extensions (Intel TDX) specification, sec 2.4.2,
TDCALL[TDINFO] provides basic TD execution environment information, not
provided by CPUID.

Call TDINFO during early boot to be used for following system
initialization.

The call provides info on which bit in pfn is used to indicate that the
page is shared with the host and attributes of the TD, such as debug.

We don't save information about the number of cpus as there's no users
so far.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/tdx.h |  2 ++
 arch/x86/kernel/tdx.c      | 23 +++++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 6c3c71bb57a0..c5a870cef0ae 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -10,6 +10,8 @@
 #include <asm/cpufeature.h>
 #include <linux/types.h>
 
+#define TDINFO			1
+
 struct tdcall_output {
 	u64 rcx;
 	u64 rdx;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 29c52128b9c0..b63275db1db9 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -4,6 +4,14 @@
 #define pr_fmt(fmt) "TDX: " fmt
 
 #include <asm/tdx.h>
+#include <asm/vmx.h>
+
+#include <linux/cpu.h>
+
+static struct {
+	unsigned int gpa_width;
+	unsigned long attributes;
+} td_info __ro_after_init;
 
 /*
  * Wrapper for use case that checks for error code and print warning message.
@@ -61,6 +69,19 @@ bool is_tdx_guest(void)
 }
 EXPORT_SYMBOL_GPL(is_tdx_guest);
 
+static void tdg_get_info(void)
+{
+	u64 ret;
+	struct tdcall_output out = {0};
+
+	ret = __tdcall(TDINFO, 0, 0, 0, 0, &out);
+
+	BUG_ON(ret);
+
+	td_info.gpa_width = out.rcx & GENMASK(5, 0);
+	td_info.attributes = out.rdx;
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
@@ -68,5 +89,7 @@ void __init tdx_early_init(void)
 
 	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
+	tdg_get_info();
+
 	pr_info("TDX guest is initialized\n");
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 07/32] x86/traps: Add do_general_protection() helper function
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (5 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 06/32] x86/tdx: Get TD execution environment information via TDINFO Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-07 21:20   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 08/32] x86/traps: Add #VE support for TDX guest Kuppuswamy Sathyanarayanan
                   ` (25 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

TDX guest #VE exception handler treats unsupported exceptions
as #GP. So to handle the #GP, move the protection fault handler
code to out of exc_general_protection() and create new helper
function for it.

Also since exception handler is responsible to decide when to
turn on/off IRQ, move cond_local_irq_{enable/disable)() calls
out of do_general_protection().

This is a preparatory patch for adding #VE exception handler
support for TDX guests.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/traps.c | 51 ++++++++++++++++++++++-------------------
 1 file changed, 27 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 651e3e508959..213d4aa8e337 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -527,44 +527,28 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
 
 #define GPFSTR "general protection fault"
 
-DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
+static void do_general_protection(struct pt_regs *regs, long error_code)
 {
 	char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
 	enum kernel_gp_hint hint = GP_NO_HINT;
-	struct task_struct *tsk;
+	struct task_struct *tsk = current;
 	unsigned long gp_addr;
 	int ret;
 
-	cond_local_irq_enable(regs);
-
-	if (static_cpu_has(X86_FEATURE_UMIP)) {
-		if (user_mode(regs) && fixup_umip_exception(regs))
-			goto exit;
-	}
-
-	if (v8086_mode(regs)) {
-		local_irq_enable();
-		handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
-		local_irq_disable();
-		return;
-	}
-
-	tsk = current;
-
 	if (user_mode(regs)) {
 		tsk->thread.error_code = error_code;
 		tsk->thread.trap_nr = X86_TRAP_GP;
 
 		if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0))
-			goto exit;
+			return;
 
 		show_signal(tsk, SIGSEGV, "", desc, regs, error_code);
 		force_sig(SIGSEGV);
-		goto exit;
+		return;
 	}
 
 	if (fixup_exception(regs, X86_TRAP_GP, error_code, 0))
-		goto exit;
+		return;
 
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_nr = X86_TRAP_GP;
@@ -576,11 +560,11 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	if (!preemptible() &&
 	    kprobe_running() &&
 	    kprobe_fault_handler(regs, X86_TRAP_GP))
-		goto exit;
+		return;
 
 	ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV);
 	if (ret == NOTIFY_STOP)
-		goto exit;
+		return;
 
 	if (error_code)
 		snprintf(desc, sizeof(desc), "segment-related " GPFSTR);
@@ -601,8 +585,27 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 		gp_addr = 0;
 
 	die_addr(desc, regs, error_code, gp_addr);
+}
 
-exit:
+DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
+{
+	cond_local_irq_enable(regs);
+
+	if (static_cpu_has(X86_FEATURE_UMIP)) {
+		if (user_mode(regs) && fixup_umip_exception(regs)) {
+			cond_local_irq_disable(regs);
+			return;
+		}
+	}
+
+	if (v8086_mode(regs)) {
+		local_irq_enable();
+		handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
+		local_irq_disable();
+		return;
+	}
+
+	do_general_protection(regs, error_code);
 	cond_local_irq_disable(regs);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (6 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 07/32] x86/traps: Add do_general_protection() helper function Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-07 21:36   ` Dave Hansen
  2021-06-08 17:02   ` [RFC v2 08/32] " Dave Hansen
  2021-04-26 18:01 ` [RFC v2 09/32] x86/tdx: Add HLT " Kuppuswamy Sathyanarayanan
                   ` (24 subsequent siblings)
  32 siblings, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The TDX module injects #VE exception to the guest TD in cases of
disallowed instructions, disallowed MSR accesses and subset of CPUID
leaves. The TDX module guarantees that no #VE is injected on an EPT
violation on guest physical addresses that are memory. We can still
get #VE on MMIO mappings. This avoids any problems with the “system
call gap”.
   
Add basic infrastructure to handle #VE. If there is no handler for a
given #VE, since it is an unexpected event (fault case), treat it as
a general protection fault and handle it using
do_general_protection() call.
   
TDCALL[TDGETVEINFO] provides information about #VE such as exit reason.

The #VE cannot be nested before TDGETVEINFO is called, if there is any
reason for it to nest the TD would shut down. The TDX module guarantees
that no NMIs (or #MC or similar) can happen in this window. After
TDGETVEINFO the #VE handler can nest if needed, although we don’t expect
it to happen normally.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/idtentry.h |  4 ++++
 arch/x86/include/asm/tdx.h      | 15 +++++++++++++
 arch/x86/kernel/idt.c           |  6 ++++++
 arch/x86/kernel/tdx.c           | 38 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps.c         | 30 ++++++++++++++++++++++++++
 5 files changed, 93 insertions(+)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5eb3bdf36a41..41a0732d5f68 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -619,6 +619,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER,	exc_xen_hypervisor_callback);
 DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER,	exc_xen_unknown_trap);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE,		exc_virtualization_exception);
+#endif
+
 /* Device interrupts common/spurious */
 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	common_interrupt);
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index c5a870cef0ae..1ca55d8e9963 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -11,6 +11,7 @@
 #include <linux/types.h>
 
 #define TDINFO			1
+#define TDGETVEINFO		3
 
 struct tdcall_output {
 	u64 rcx;
@@ -29,6 +30,20 @@ struct tdvmcall_output {
 	u64 r15;
 };
 
+struct ve_info {
+	u64 exit_reason;
+	u64 exit_qual;
+	u64 gla;
+	u64 gpa;
+	u32 instr_len;
+	u32 instr_info;
+};
+
+unsigned long tdg_get_ve_info(struct ve_info *ve);
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+		struct ve_info *ve);
+
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..546b6b636c7d 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = {
 	 */
 	INTG(X86_TRAP_PF,		asm_exc_page_fault),
 #endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
+#endif
 };
 
 /*
@@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = {
 	INTG(X86_TRAP_MF,		asm_exc_coprocessor_error),
 	INTG(X86_TRAP_AC,		asm_exc_alignment_check),
 	INTG(X86_TRAP_XF,		asm_exc_simd_coprocessor_error),
+#ifdef CONFIG_INTEL_TDX_GUEST
+	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
+#endif
 
 #ifdef CONFIG_X86_32
 	TSKG(X86_TRAP_DF,		GDT_ENTRY_DOUBLEFAULT_TSS),
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index b63275db1db9..ccfcb07bfb2c 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -82,6 +82,44 @@ static void tdg_get_info(void)
 	td_info.attributes = out.rdx;
 }
 
+unsigned long tdg_get_ve_info(struct ve_info *ve)
+{
+	u64 ret;
+	struct tdcall_output out = {0};
+
+	/*
+	 * The #VE cannot be nested before TDGETVEINFO is called,
+	 * if there is any reason for it to nest the TD would shut
+	 * down. The TDX module guarantees that no NMIs (or #MC or
+	 * similar) can happen in this window. After TDGETVEINFO
+	 * the #VE handler can nest if needed, although we don’t
+	 * expect it to happen normally.
+	 */
+
+	ret = __tdcall(TDGETVEINFO, 0, 0, 0, 0, &out);
+
+	ve->exit_reason = out.rcx;
+	ve->exit_qual   = out.rdx;
+	ve->gla         = out.r8;
+	ve->gpa         = out.r9;
+	ve->instr_len   = out.r10 & UINT_MAX;
+	ve->instr_info  = out.r10 >> 32;
+
+	return ret;
+}
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+		struct ve_info *ve)
+{
+	/*
+	 * TODO: Add handler support for various #VE exit
+	 * reasons. It will be added by other patches in
+	 * the series.
+	 */
+	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+	return -EFAULT;
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 213d4aa8e337..64869aa88a5a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/vdso.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_X86_64
 #include <asm/x86_init.h>
@@ -1140,6 +1141,35 @@ DEFINE_IDTENTRY(exc_device_not_available)
 	}
 }
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+	struct ve_info ve;
+	int ret;
+
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+
+	/*
+	 * Consume #VE info before re-enabling interrupts. It will be
+	 * re-enabled after executing the TDGETVEINFO TDCALL.
+	 */
+	ret = tdg_get_ve_info(&ve);
+
+	cond_local_irq_enable(regs);
+
+	if (!ret)
+		ret = tdg_handle_virtualization_exception(regs, &ve);
+	/*
+	 * If tdg_handle_virtualization_exception() could not process
+	 * it successfully, treat it as #GP(0) and handle it.
+	 */
+	if (ret)
+		do_general_protection(regs, 0);
+
+	cond_local_irq_disable(regs);
+}
+#endif
+
 #ifdef CONFIG_X86_32
 DEFINE_IDTENTRY_SW(iret_error)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 09/32] x86/tdx: Add HLT support for TDX guest
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (7 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 08/32] x86/traps: Add #VE support for TDX guest Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls Kuppuswamy Sathyanarayanan
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Per Guest-Host-Communication Interface (GHCI) for Intel Trust
Domain Extensions (Intel TDX) specification, sec 3.8,
TDVMCALL[Instruction.HLT] provides HLT operation. Use it to implement
halt() and safe_halt() paravirtualization calls.

The same TDVMCALL is used to handle #VE exception due to
EXIT_REASON_HLT.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 44 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 37 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ccfcb07bfb2c..5169f72b6b3f 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -82,6 +82,27 @@ static void tdg_get_info(void)
 	td_info.attributes = out.rdx;
 }
 
+static __cpuidle void tdg_halt(void)
+{
+	u64 ret;
+
+	ret = __tdvmcall(EXIT_REASON_HLT, 0, 0, 0, 0, NULL);
+
+	/* It should never fail */
+	BUG_ON(ret);
+}
+
+static __cpuidle void tdg_safe_halt(void)
+{
+	/*
+	 * Enable interrupts next to the TDVMCALL to avoid
+	 * performance degradation.
+	 */
+	asm volatile("sti\n\t");
+
+	tdg_halt();
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -111,13 +132,19 @@ unsigned long tdg_get_ve_info(struct ve_info *ve)
 int tdg_handle_virtualization_exception(struct pt_regs *regs,
 		struct ve_info *ve)
 {
-	/*
-	 * TODO: Add handler support for various #VE exit
-	 * reasons. It will be added by other patches in
-	 * the series.
-	 */
-	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
-	return -EFAULT;
+	switch (ve->exit_reason) {
+	case EXIT_REASON_HLT:
+		tdg_halt();
+		break;
+	default:
+		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+		return -EFAULT;
+	}
+
+	/* After successful #VE handling, move the IP */
+	regs->ip += ve->instr_len;
+
+	return 0;
 }
 
 void __init tdx_early_init(void)
@@ -129,5 +156,8 @@ void __init tdx_early_init(void)
 
 	tdg_get_info();
 
+	pv_ops.irq.safe_halt = tdg_safe_halt;
+	pv_ops.irq.halt = tdg_halt;
+
 	pr_info("TDX guest is initialized\n");
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (8 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 09/32] x86/tdx: Add HLT " Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-07 21:46   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 11/32] x86/tdx: Add MSR support for TDX guest Kuppuswamy Sathyanarayanan
                   ` (22 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

KVM hypercalls have to be wrapped into vendor-specific TDVMCALLs.

[Isaku: proposed KVM VENDOR string]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/kvm_para.h | 21 +++++++++++++++
 arch/x86/include/asm/tdx.h      | 39 ++++++++++++++++++++++++++++
 arch/x86/kernel/tdcall.S        |  7 +++++
 arch/x86/kernel/tdx-kvm.c       | 45 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c           |  4 +++
 5 files changed, 116 insertions(+)
 create mode 100644 arch/x86/kernel/tdx-kvm.c

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..2fa85481520b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,6 +6,7 @@
 #include <asm/alternative.h>
 #include <linux/interrupt.h>
 #include <uapi/asm/kvm_para.h>
+#include <asm/tdx.h>
 
 extern void kvmclock_init(void);
 
@@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 static inline long kvm_hypercall0(unsigned int nr)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall0(nr);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr)
@@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
 static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall1(nr, p1);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1)
@@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
 				  unsigned long p2)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall2(nr, p1, p2);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2)
@@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
 				  unsigned long p2, unsigned long p3)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall3(nr, p1, p2, p3);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 				  unsigned long p4)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 1ca55d8e9963..e0b3ed9e262c 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -56,6 +56,16 @@ u64 __tdcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 /* Helper function used to request services from VMM */
 u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
 	       struct tdvmcall_output *out);
+u64 __tdvmcall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+			  struct tdvmcall_output *out);
+
+long tdx_kvm_hypercall0(unsigned int nr);
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1);
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2);
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3);
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3, unsigned long p4);
 
 #else // !CONFIG_INTEL_TDX_GUEST
 
@@ -66,6 +76,35 @@ static inline bool is_tdx_guest(void)
 
 static inline void tdx_early_init(void) { };
 
+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+				      unsigned long p2)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3,
+				      unsigned long p4)
+{
+	return -ENODEV;
+}
+
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index 81af70c2acbd..964bfd7fc682 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -11,6 +11,7 @@
  * refer to TDX GHCI specification).
  */
 #define TDVMCALL_EXPOSE_REGS_MASK	0xfc00
+#define TDVMCALL_VENDOR_KVM		0x4d564b2e584454 /* "TDX.KVM" */
 
 /*
  * TDX guests use the TDCALL instruction to make
@@ -198,3 +199,9 @@ SYM_FUNC_START(__tdvmcall)
 	call do_tdvmcall
 	retq
 SYM_FUNC_END(__tdvmcall)
+
+SYM_FUNC_START(__tdvmcall_vendor_kvm)
+	movq $TDVMCALL_VENDOR_KVM, %r10
+	call do_tdvmcall
+	retq
+SYM_FUNC_END(__tdvmcall_vendor_kvm)
diff --git a/arch/x86/kernel/tdx-kvm.c b/arch/x86/kernel/tdx-kvm.c
new file mode 100644
index 000000000000..c4264e926712
--- /dev/null
+++ b/arch/x86/kernel/tdx-kvm.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+
+static long tdvmcall_vendor(unsigned int fn, unsigned long r12,
+			    unsigned long r13, unsigned long r14,
+			    unsigned long r15)
+{
+	return __tdvmcall_vendor_kvm(fn, r12, r13, r14, r15, NULL);
+}
+
+/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return tdvmcall_vendor(nr, 0, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall0);
+
+/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return tdvmcall_vendor(nr, p1, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall1);
+
+/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2)
+{
+	return tdvmcall_vendor(nr, p1, p2, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall2);
+
+/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3)
+{
+	return tdvmcall_vendor(nr, p1, p2, p3, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall3);
+
+/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3, unsigned long p4)
+{
+	return tdvmcall_vendor(nr, p1, p2, p3, p4);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 5169f72b6b3f..721c213d807d 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -8,6 +8,10 @@
 
 #include <linux/cpu.h>
 
+#ifdef CONFIG_KVM_GUEST
+#include "tdx-kvm.c"
+#endif
+
 static struct {
 	unsigned int gpa_width;
 	unsigned long attributes;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 11/32] x86/tdx: Add MSR support for TDX guest
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (9 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 12/32] x86/tdx: Handle CPUID via #VE Kuppuswamy Sathyanarayanan
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Operations on context-switched MSRs can be run natively. The rest of
MSRs should be handled through TDVMCALLs.

TDVMCALL[Instruction.RDMSR] and TDVMCALL[Instruction.WRMSR] provide
MSR oprations.

You can find RDMSR and WRMSR details in Guest-Host-Communication
Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX)
specification, sec 3.10, 3.11.

Also, since CSTAR MSR is not used on Intel CPUs as SYSCALL
instruction, ignore accesses to CSTAR MSR.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 85 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 83 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 721c213d807d..5b16707b3577 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -107,6 +107,73 @@ static __cpuidle void tdg_safe_halt(void)
 	tdg_halt();
 }
 
+static bool tdg_is_context_switched_msr(unsigned int msr)
+{
+	/*  XXX: Update the list of context-switched MSRs */
+
+	switch (msr) {
+	case MSR_EFER:
+	case MSR_IA32_CR_PAT:
+	case MSR_FS_BASE:
+	case MSR_GS_BASE:
+	case MSR_KERNEL_GS_BASE:
+	case MSR_IA32_SYSENTER_CS:
+	case MSR_IA32_SYSENTER_EIP:
+	case MSR_IA32_SYSENTER_ESP:
+	case MSR_STAR:
+	case MSR_LSTAR:
+	case MSR_SYSCALL_MASK:
+	case MSR_IA32_XSS:
+	case MSR_TSC_AUX:
+	case MSR_IA32_BNDCFGS:
+		return true;
+	}
+	return false;
+}
+
+static u64 tdg_read_msr_safe(unsigned int msr, int *err)
+{
+	u64 ret;
+	struct tdvmcall_output out = {0};
+
+	WARN_ON_ONCE(tdg_is_context_switched_msr(msr));
+
+	/*
+	 * Since CSTAR MSR is not used by Intel CPUs as SYSCALL
+	 * instruction, just ignore it. Even raising TDVMCALL
+	 * will lead to same result.
+	 */
+	if (msr == MSR_CSTAR)
+		return 0;
+
+	ret = __tdvmcall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out);
+
+	*err = (ret) ? -EIO : 0;
+
+	return out.r11;
+}
+
+static int tdg_write_msr_safe(unsigned int msr, unsigned int low,
+			      unsigned int high)
+{
+	u64 ret;
+
+	WARN_ON_ONCE(tdg_is_context_switched_msr(msr));
+
+	/*
+	 * Since CSTAR MSR is not used by Intel CPUs as SYSCALL
+	 * instruction, just ignore it. Even raising TDVMCALL
+	 * will lead to same result.
+	 */
+	if (msr == MSR_CSTAR)
+		return 0;
+
+	ret = __tdvmcall(EXIT_REASON_MSR_WRITE, msr, (u64)high << 32 | low,
+			 0, 0, NULL);
+
+	return ret ? -EIO : 0;
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -136,19 +203,33 @@ unsigned long tdg_get_ve_info(struct ve_info *ve)
 int tdg_handle_virtualization_exception(struct pt_regs *regs,
 		struct ve_info *ve)
 {
+	unsigned long val;
+	int ret = 0;
+
 	switch (ve->exit_reason) {
 	case EXIT_REASON_HLT:
 		tdg_halt();
 		break;
+	case EXIT_REASON_MSR_READ:
+		val = tdg_read_msr_safe(regs->cx, (unsigned int *)&ret);
+		if (!ret) {
+			regs->ax = val & UINT_MAX;
+			regs->dx = val >> 32;
+		}
+		break;
+	case EXIT_REASON_MSR_WRITE:
+		ret = tdg_write_msr_safe(regs->cx, regs->ax, regs->dx);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
 	}
 
 	/* After successful #VE handling, move the IP */
-	regs->ip += ve->instr_len;
+	if (!ret)
+		regs->ip += ve->instr_len;
 
-	return 0;
+	return ret;
 }
 
 void __init tdx_early_init(void)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 12/32] x86/tdx: Handle CPUID via #VE
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (10 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 11/32] x86/tdx: Add MSR support for TDX guest Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 13/32] x86/io: Allow to override inX() and outX() implementation Kuppuswamy Sathyanarayanan
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

TDX has three classes of CPUID leaves: some CPUID leaves
are always handled by the CPU, others are handled by the TDX module,
and some others are handled by the VMM. Since the VMM cannot directly
intercept the instruction these are reflected with a #VE exception
to the guest, which then converts it into a TDCALL to the VMM,
or handled directly.

The TDX module EAS has a full list of CPUID leaves which are handled
natively or by the TDX module in 16.2. Only unknown CPUIDs are handled by
the #VE method. In practice this typically only applies to the
hypervisor specific CPUIDs unknown to the native CPU.

Therefore there is no risk of causing this in early CPUID code which
runs before the #VE handler is set up because it will never access
those exotic CPUID leaves.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 5b16707b3577..e42e260df245 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -174,6 +174,21 @@ static int tdg_write_msr_safe(unsigned int msr, unsigned int low,
 	return ret ? -EIO : 0;
 }
 
+static void tdg_handle_cpuid(struct pt_regs *regs)
+{
+	u64 ret;
+	struct tdvmcall_output out = {0};
+
+	ret = __tdvmcall(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0, &out);
+
+	WARN_ON(ret);
+
+	regs->ax = out.r12;
+	regs->bx = out.r13;
+	regs->cx = out.r14;
+	regs->dx = out.r15;
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -220,6 +235,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_MSR_WRITE:
 		ret = tdg_write_msr_safe(regs->cx, regs->ax, regs->dx);
 		break;
+	case EXIT_REASON_CPUID:
+		tdg_handle_cpuid(regs);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 13/32] x86/io: Allow to override inX() and outX() implementation
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (11 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 12/32] x86/tdx: Handle CPUID via #VE Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 14/32] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The patch allows to override the implementation of the port IO
helpers. TDX code will provide an implementation that redirect the
helpers to paravirt calls.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/io.h | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d726459d08e5..ef7a686a55a9 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -271,18 +271,26 @@ static inline bool sev_key_active(void) { return false; }
 
 #endif /* CONFIG_AMD_MEM_ENCRYPT */
 
+#ifndef __out
+#define __out(bwl, bw)							\
+	asm volatile("out" #bwl " %" #bw "0, %w1" : : "a"(value), "Nd"(port))
+#endif
+
+#ifndef __in
+#define __in(bwl, bw)							\
+	asm volatile("in" #bwl " %w1, %" #bw "0" : "=a"(value) : "Nd"(port))
+#endif
+
 #define BUILDIO(bwl, bw, type)						\
 static inline void out##bwl(unsigned type value, int port)		\
 {									\
-	asm volatile("out" #bwl " %" #bw "0, %w1"			\
-		     : : "a"(value), "Nd"(port));			\
+	__out(bwl, bw);							\
 }									\
 									\
 static inline unsigned type in##bwl(int port)				\
 {									\
 	unsigned type value;						\
-	asm volatile("in" #bwl " %w1, %" #bw "0"			\
-		     : "=a"(value) : "Nd"(port));			\
+	__in(bwl, bw);							\
 	return value;							\
 }									\
 									\
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (12 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 13/32] x86/io: Allow to override inX() and outX() implementation Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-10 21:57   ` Dan Williams
  2021-04-26 18:01 ` [RFC v2 15/32] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
                   ` (18 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Unroll string operations and handle port I/O through TDVMCALLs.
Also handle #VE due to I/O operations with the same TDVMCALLs.

Decompression code uses port IO for earlyprintk. We must use
paravirt calls there too if we want to allow earlyprintk.

Decompresion code cannot deal with alternatives: use branches
instead to implement inX() and outX() helpers.

Since we use call instruction in place of in/out instruction,
the argument passed to call instruction has to be in a
register, it cannot be an immediate value like in/out
instruction. So change constraint flag from "Nd" to "d"

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile |   1 +
 arch/x86/boot/compressed/tdcall.S |   9 ++
 arch/x86/include/asm/io.h         |   5 +-
 arch/x86/include/asm/tdx.h        |  46 ++++++++-
 arch/x86/kernel/tdcall.S          | 154 ++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c             |  33 +++++++
 6 files changed, 245 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdcall.S

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index a2554621cefe..a944a2038797 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -97,6 +97,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdcall.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
new file mode 100644
index 000000000000..5ebb80d45ad8
--- /dev/null
+++ b/arch/x86/boot/compressed/tdcall.S
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <asm/export.h>
+
+/* Do not export symbols in decompression code */
+#undef EXPORT_SYMBOL
+#define EXPORT_SYMBOL(sym)
+
+#include "../../kernel/tdcall.S"
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index ef7a686a55a9..30a3b30395ad 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -43,6 +43,7 @@
 #include <asm/page.h>
 #include <asm/early_ioremap.h>
 #include <asm/pgtable_types.h>
+#include <asm/tdx.h>
 
 #define build_mmio_read(name, size, type, reg, barrier) \
 static inline type name(const volatile void __iomem *addr) \
@@ -309,7 +310,7 @@ static inline unsigned type in##bwl##_p(int port)			\
 									\
 static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 {									\
-	if (sev_key_active()) {						\
+	if (sev_key_active() || is_tdx_guest()) {			\
 		unsigned type *value = (unsigned type *)addr;		\
 		while (count) {						\
 			out##bwl(*value, port);				\
@@ -325,7 +326,7 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 									\
 static inline void ins##bwl(int port, void *addr, unsigned long count)	\
 {									\
-	if (sev_key_active()) {						\
+	if (sev_key_active() || is_tdx_guest()) {			\
 		unsigned type *value = (unsigned type *)addr;		\
 		while (count) {						\
 			*value = in##bwl(port);				\
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index e0b3ed9e262c..b972c6531a53 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -5,6 +5,8 @@
 
 #define TDX_CPUID_LEAF_ID	0x21
 
+#ifndef __ASSEMBLY__
+
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
@@ -67,6 +69,48 @@ long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
 long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
 		unsigned long p3, unsigned long p4);
 
+/* Decompression code doesn't know how to handle alternatives */
+#ifdef BOOT_COMPRESSED_MISC_H
+#define __out(bwl, bw)							\
+do {									\
+	if (is_tdx_guest()) {						\
+		asm volatile("call tdg_out" #bwl : :			\
+				"a"(value), "d"(port));			\
+	} else {							\
+		asm volatile("out" #bwl " %" #bw "0, %w1" : :		\
+				"a"(value), "Nd"(port));		\
+	}								\
+} while (0)
+#define __in(bwl, bw)							\
+do {									\
+	if (is_tdx_guest()) {						\
+		asm volatile("call tdg_in" #bwl :			\
+				"=a"(value) : "d"(port));		\
+	} else {							\
+		asm volatile("in" #bwl " %w1, %" #bw "0" :		\
+				"=a"(value) : "Nd"(port));		\
+	}								\
+} while (0)
+#else
+#define __out(bwl, bw)							\
+	alternative_input("out" #bwl " %" #bw "1, %w2",			\
+			"call tdg_out" #bwl, X86_FEATURE_TDX_GUEST,	\
+			"a"(value), "d"(port))
+
+#define __in(bwl, bw)							\
+	alternative_io("in" #bwl " %w2, %" #bw "0",			\
+			"call tdg_in" #bwl, X86_FEATURE_TDX_GUEST,	\
+			"=a"(value), "d"(port))
+#endif
+
+void tdg_outb(unsigned char value, unsigned short port);
+void tdg_outw(unsigned short value, unsigned short port);
+void tdg_outl(unsigned int value, unsigned short port);
+
+unsigned char tdg_inb(unsigned short port);
+unsigned short tdg_inw(unsigned short port);
+unsigned int tdg_inl(unsigned short port);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
@@ -106,5 +150,5 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
 }
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
-
+#endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index 964bfd7fc682..df4159bb5103 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -3,6 +3,7 @@
 #include <asm/asm.h>
 #include <asm/frame.h>
 #include <asm/unwind_hints.h>
+#include <asm/export.h>
 
 #include <linux/linkage.h>
 
@@ -12,6 +13,12 @@
  */
 #define TDVMCALL_EXPOSE_REGS_MASK	0xfc00
 #define TDVMCALL_VENDOR_KVM		0x4d564b2e584454 /* "TDX.KVM" */
+#define EXIT_REASON_IO_INSTRUCTION	30
+/*
+ * Current size of struct tdvmcall_output is 40 bytes,
+ * but allocate double to account future changes.
+ */
+#define TDVMCALL_OUTPUT_SIZE		80
 
 /*
  * TDX guests use the TDCALL instruction to make
@@ -205,3 +212,150 @@ SYM_FUNC_START(__tdvmcall_vendor_kvm)
 	call do_tdvmcall
 	retq
 SYM_FUNC_END(__tdvmcall_vendor_kvm)
+
+.macro io_save_registers
+	push %rbp
+	push %rbx
+	push %rcx
+	push %rdx
+	push %rdi
+	push %rsi
+	push %r8
+	push %r9
+	push %r10
+	push %r11
+	push %r12
+	push %r13
+	push %r14
+	push %r15
+.endm
+.macro io_restore_registers
+	pop %r15
+	pop %r14
+	pop %r13
+	pop %r12
+	pop %r11
+	pop %r10
+	pop %r9
+	pop %r8
+	pop %rsi
+	pop %rdi
+	pop %rdx
+	pop %rcx
+	pop %rbx
+	pop %rbp
+.endm
+
+/*
+ * tdg_out{b,w,l}()  - Write given data to the specified port.
+ *
+ * @arg1 (RAX)       - Value to be written (passed via R8 to do_tdvmcall()).
+ * @arg2 (RDX)       - Port id (passed via RCX to do_tdvmcall()).
+ *
+ */
+SYM_FUNC_START(tdg_outb)
+	io_save_registers
+	xor %r8, %r8
+	/* Move data to R8 register */
+	mov %al, %r8b
+	/* Set data width to 1 byte */
+	mov $1, %rsi
+	jmp 1f
+
+SYM_FUNC_START(tdg_outw)
+	io_save_registers
+	xor %r8, %r8
+	/* Move data to R8 register */
+	mov %ax, %r8w
+	/* Set data width to 2 bytes */
+	mov $2, %rsi
+	jmp 1f
+
+SYM_FUNC_START(tdg_outl)
+	io_save_registers
+	xor %r8, %r8
+	/* Move data to R8 register */
+	mov %eax, %r8d
+	/* Set data width to 4 bytes */
+	mov $4, %rsi
+1:
+	/*
+	 * Since io_save_registers does not save rax
+	 * state, save it here so that we can preserve
+	 * the caller register state.
+	 */
+	push %rax
+
+	mov %rdx, %rcx
+	/* Set 1 in RDX to select out operation */
+	mov $1, %rdx
+	/* Set TDVMCALL function id in RDI */
+	mov $EXIT_REASON_IO_INSTRUCTION, %rdi
+	/* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+	xor %r10, %r10
+	/* Since we don't use tdvmcall output, set it to NULL */
+	xor %r9, %r9
+
+	call do_tdvmcall
+
+	pop %rax
+	io_restore_registers
+	ret
+SYM_FUNC_END(tdg_outb)
+SYM_FUNC_END(tdg_outw)
+SYM_FUNC_END(tdg_outl)
+EXPORT_SYMBOL(tdg_outb)
+EXPORT_SYMBOL(tdg_outw)
+EXPORT_SYMBOL(tdg_outl)
+
+/*
+ * tdg_in{b,w,l}()   - Read data to the specified port.
+ *
+ * @arg1 (RDX)       - Port id (passed via RCX to do_tdvmcall()).
+ *
+ * Returns data read via RAX register.
+ *
+ */
+SYM_FUNC_START(tdg_inb)
+	io_save_registers
+	/* Set data width to 1 byte */
+	mov $1, %rsi
+	jmp 1f
+
+SYM_FUNC_START(tdg_inw)
+	io_save_registers
+	/* Set data width to 2 bytes */
+	mov $2, %rsi
+	jmp 1f
+
+SYM_FUNC_START(tdg_inl)
+	io_save_registers
+	/* Set data width to 4 bytes */
+	mov $4, %rsi
+1:
+	mov %rdx, %rcx
+	/* Set 0 in RDX to select in operation */
+	mov $0, %rdx
+	/* Set TDVMCALL function id in RDI */
+	mov $EXIT_REASON_IO_INSTRUCTION, %rdi
+	/* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+	xor %r10, %r10
+	/* Allocate memory in stack for Output */
+	subq $TDVMCALL_OUTPUT_SIZE, %rsp
+	/* Move tdvmcall_output pointer to R9 */
+	movq %rsp, %r9
+
+	call do_tdvmcall
+
+	/* Move data read from port to RAX */
+	mov TDVMCALL_r11(%r9), %eax
+	/* Free allocated memory */
+	addq $TDVMCALL_OUTPUT_SIZE, %rsp
+	io_restore_registers
+	ret
+SYM_FUNC_END(tdg_inb)
+SYM_FUNC_END(tdg_inw)
+SYM_FUNC_END(tdg_inl)
+EXPORT_SYMBOL(tdg_inb)
+EXPORT_SYMBOL(tdg_inw)
+EXPORT_SYMBOL(tdg_inl)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e42e260df245..ec61f2f06c98 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -189,6 +189,36 @@ static void tdg_handle_cpuid(struct pt_regs *regs)
 	regs->dx = out.r15;
 }
 
+static void tdg_out(int size, int port, unsigned int value)
+{
+	tdvmcall(EXIT_REASON_IO_INSTRUCTION, size, 1, port, value);
+}
+
+static unsigned int tdg_in(int size, int port)
+{
+	return tdvmcall_out_r11(EXIT_REASON_IO_INSTRUCTION, size, 0, port, 0);
+}
+
+static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
+{
+	bool string = exit_qual & 16;
+	int out, size, port;
+
+	/* I/O strings ops are unrolled at build time. */
+	BUG_ON(string);
+
+	out = (exit_qual & 8) ? 0 : 1;
+	size = (exit_qual & 7) + 1;
+	port = exit_qual >> 16;
+
+	if (out) {
+		tdg_out(size, port, regs->ax);
+	} else {
+		regs->ax &= ~GENMASK(8 * size, 0);
+		regs->ax |= tdg_in(size, port) & GENMASK(8 * size, 0);
+	}
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -238,6 +268,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_CPUID:
 		tdg_handle_cpuid(regs);
 		break;
+	case EXIT_REASON_IO_INSTRUCTION:
+		tdg_handle_io(regs, ve->exit_qual);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 15/32] x86/tdx: Handle in-kernel MMIO
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (13 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 14/32] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-07 21:52   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Kuppuswamy Sathyanarayanan
                   ` (17 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
exit reason.

For now we only handle subset of instruction that kernel uses for MMIO
oerations. User-space access triggers SIGBUS.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 100 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ec61f2f06c98..3fe617978fc4 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -5,6 +5,8 @@
 
 #include <asm/tdx.h>
 #include <asm/vmx.h>
+#include <asm/insn.h>
+#include <linux/sched/signal.h> /* force_sig_fault() */
 
 #include <linux/cpu.h>
 
@@ -219,6 +221,101 @@ static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
 	}
 }
 
+static unsigned long tdg_mmio(int size, bool write, unsigned long addr,
+		unsigned long val)
+{
+	return tdvmcall_out_r11(EXIT_REASON_EPT_VIOLATION, size,
+				write, addr, val);
+}
+
+static inline void *get_reg_ptr(struct pt_regs *regs, struct insn *insn)
+{
+	static const int regoff[] = {
+		offsetof(struct pt_regs, ax),
+		offsetof(struct pt_regs, cx),
+		offsetof(struct pt_regs, dx),
+		offsetof(struct pt_regs, bx),
+		offsetof(struct pt_regs, sp),
+		offsetof(struct pt_regs, bp),
+		offsetof(struct pt_regs, si),
+		offsetof(struct pt_regs, di),
+		offsetof(struct pt_regs, r8),
+		offsetof(struct pt_regs, r9),
+		offsetof(struct pt_regs, r10),
+		offsetof(struct pt_regs, r11),
+		offsetof(struct pt_regs, r12),
+		offsetof(struct pt_regs, r13),
+		offsetof(struct pt_regs, r14),
+		offsetof(struct pt_regs, r15),
+	};
+	int regno;
+
+	regno = X86_MODRM_REG(insn->modrm.value);
+	if (X86_REX_R(insn->rex_prefix.value))
+		regno += 8;
+
+	return (void *)regs + regoff[regno];
+}
+
+static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+	int size;
+	bool write;
+	unsigned long *reg;
+	struct insn insn;
+	unsigned long val = 0;
+
+	/*
+	 * User mode would mean the kernel exposed a device directly
+	 * to ring3, which shouldn't happen except for things like
+	 * DPDK.
+	 */
+	if (user_mode(regs)) {
+		pr_err("Unexpected user-mode MMIO access.\n");
+		force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);
+		return 0;
+	}
+
+	kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
+	insn_get_length(&insn);
+	insn_get_opcode(&insn);
+
+	write = ve->exit_qual & 0x2;
+
+	size = insn.opnd_bytes;
+	switch (insn.opcode.bytes[0]) {
+	/* MOV r/m8	r8	*/
+	case 0x88:
+	/* MOV r8	r/m8	*/
+	case 0x8A:
+	/* MOV r/m8	imm8	*/
+	case 0xC6:
+		size = 1;
+		break;
+	}
+
+	if (inat_has_immediate(insn.attr)) {
+		BUG_ON(!write);
+		val = insn.immediate.value;
+		tdg_mmio(size, write, ve->gpa, val);
+		return insn.length;
+	}
+
+	BUG_ON(!inat_has_modrm(insn.attr));
+
+	reg = get_reg_ptr(regs, &insn);
+
+	if (write) {
+		memcpy(&val, reg, size);
+		tdg_mmio(size, write, ve->gpa, val);
+	} else {
+		val = tdg_mmio(size, write, ve->gpa, val);
+		memset(reg, 0, size);
+		memcpy(reg, &val, size);
+	}
+	return insn.length;
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -271,6 +368,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_IO_INSTRUCTION:
 		tdg_handle_io(regs, ve->exit_qual);
 		break;
+	case EXIT_REASON_EPT_VIOLATION:
+		ve->instr_len = tdg_handle_mmio(regs, ve);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (14 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 15/32] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-11  1:23   ` Dan Williams
  2021-05-11 15:53   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 17/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Structure Kuppuswamy Sathyanarayanan
                   ` (16 subsequent siblings)
  32 siblings, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

When running as a TDX guest, there are a number of existing,
privileged instructions that do not work. If the guest kernel
uses these instructions, the hardware generates a #VE.

You can find the list of unsupported instructions in Intel
Trust Domain Extensions (Intel® TDX) Module specification,
sec 9.2.2 and in Guest-Host Communication Interface (GHCI)
Specification for Intel TDX, sec 2.4.1.
   
To prevent TD guest from using MWAIT/MONITOR instructions,
support for these instructions are already disabled by TDX
module (SEAM). So CPUID flags for these instructions should
be in disabled state.

After the above mentioned preventive measures, if TD guests still
execute these instructions, add appropriate warning messages in #VE
handler. For WBIND instruction, since it's related to memory writeback
and cache flushes, it's mainly used in context of IO devices. Since
TDX 1.0 does not support non-virtual I/O devices, skipping it should
not cause any fatal issues. But to let users know about its usage, use
WARN() to report about it.. For MWAIT/MONITOR instruction, since its
unsupported use WARN() to report unsupported usage.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 3fe617978fc4..294dda5bf3f6 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -371,6 +371,21 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_EPT_VIOLATION:
 		ve->instr_len = tdg_handle_mmio(regs, ve);
 		break;
+	case EXIT_REASON_WBINVD:
+		/*
+		 * WBINVD is not supported inside TDX guests. All in-
+		 * kernel uses should have been disabled.
+		 */
+		WARN_ONCE(1, "TD Guest used unsupported WBINVD instruction\n");
+		break;
+	case EXIT_REASON_MONITOR_INSTRUCTION:
+	case EXIT_REASON_MWAIT_INSTRUCTION:
+		/*
+		 * Something in the kernel used MONITOR or MWAIT despite
+		 * X86_FEATURE_MWAIT being cleared for TDX guests.
+		 */
+		WARN_ONCE(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n");
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 17/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Structure
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (15 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 18/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Mailbox Structure Kuppuswamy Sathyanarayanan
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Erik Kaneda,
	Bob Moore, Rafael J . Wysocki

From: Erik Kaneda <erik.kaneda@intel.com>

ACPICA commit b9eb6f3a19b816824d6f47a6bc86fd8ce690e04b

Link: https://github.com/acpica/acpica/commit/b9eb6f3a
Signed-off-by: Erik Kaneda <erik.kaneda@intel.com>
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 include/acpi/actbl2.h | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index d6478c430c99..b2362600b9ff 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -516,7 +516,8 @@ enum acpi_madt_type {
 	ACPI_MADT_TYPE_GENERIC_MSI_FRAME = 13,
 	ACPI_MADT_TYPE_GENERIC_REDISTRIBUTOR = 14,
 	ACPI_MADT_TYPE_GENERIC_TRANSLATOR = 15,
-	ACPI_MADT_TYPE_RESERVED = 16	/* 16 and greater are reserved */
+	ACPI_MADT_TYPE_MULTIPROC_WAKEUP = 16,
+	ACPI_MADT_TYPE_RESERVED = 17	/* 17 and greater are reserved */
 };
 
 /*
@@ -723,6 +724,15 @@ struct acpi_madt_generic_translator {
 	u32 reserved2;
 };
 
+/* 16: Multiprocessor wakeup (ACPI 6.4) */
+
+struct acpi_madt_multiproc_wakeup {
+	struct acpi_subtable_header header;
+	u16 mailbox_version;
+	u32 reserved;		/* reserved - must be zero */
+	u64 base_address;
+};
+
 /*
  * Common flags fields for MADT subtables
  */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 18/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Mailbox Structure
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (16 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 17/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Structure Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 19/32] ACPI/table: Print MADT Wake table information Kuppuswamy Sathyanarayanan
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

ACPICA commit f1ee04207a212f6c519441e7e25397649ebc4cea

Add Multiprocessor Wakeup Mailbox Structure definition. It is useful
in parsing MADT Wake table.

Link: https://github.com/acpica/acpica/commit/f1ee0420
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 include/acpi/actbl2.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index b2362600b9ff..7dce422f6119 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -733,6 +733,20 @@ struct acpi_madt_multiproc_wakeup {
 	u64 base_address;
 };
 
+#define ACPI_MULTIPROC_WAKEUP_MB_OS_SIZE	2032
+#define ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE	2048
+
+struct acpi_madt_multiproc_wakeup_mailbox {
+	u16 command;
+	u16 reserved;		/* reserved - must be zero */
+	u32 apic_id;
+	u64 wakeup_vector;
+	u8 reserved_os[ACPI_MULTIPROC_WAKEUP_MB_OS_SIZE];	/* reserved for OS use */
+	u8 reserved_firmware[ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE];	/* reserved for firmware use */
+};
+
+#define ACPI_MP_WAKE_COMMAND_WAKEUP    1
+
 /*
  * Common flags fields for MADT subtables
  */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 19/32] ACPI/table: Print MADT Wake table information
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (17 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 18/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Mailbox Structure Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 20/32] x86/acpi, x86/boot: Add multiprocessor wake-up support Kuppuswamy Sathyanarayanan
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan, Rafael J . Wysocki

When MADT is parsed, print MADT Wake table information as
debug message. It will be useful to debug CPU boot issues
related to MADT wake table.

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/acpi/tables.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 9d581045acff..206df4ad8b2b 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -207,6 +207,17 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
 		}
 		break;
 
+	case ACPI_MADT_TYPE_MULTIPROC_WAKEUP:
+		{
+			struct acpi_madt_multiproc_wakeup *p;
+
+			p = (struct acpi_madt_multiproc_wakeup *) header;
+
+			pr_debug("MP Wake (Mailbox version[%d] base_address[%llx])\n",
+				 p->mailbox_version, p->base_address);
+		}
+		break;
+
 	default:
 		pr_warn("Found unsupported MADT entry (type = 0x%x)\n",
 			header->type);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 20/32] x86/acpi, x86/boot: Add multiprocessor wake-up support
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (18 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 19/32] ACPI/table: Print MADT Wake table information Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 21/32] x86/boot: Add a trampoline for APs booting in 64-bit mode Kuppuswamy Sathyanarayanan
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan, Sean Christopherson

As per ACPI specification r6.4, sec 5.2.12.19, a new sub
structure – multiprocessor wake-up structure - is added to the
ACPI Multiple APIC Description Table (MADT) to describe the
information of the mailbox. If a platform firmware produces the
multiprocessor wake-up structure, then OS may use this new
mailbox-based mechanism to wake up the APs.

Add ACPI MADT wake table parsing support for x86 platform and if
MADT wake table is present, update apic->wakeup_secondary_cpu with
new API which uses MADT wake mailbox to wake-up CPU.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/apic.h |  3 ++
 arch/x86/kernel/acpi/boot.c | 79 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/apic/apic.c |  8 ++++
 3 files changed, 90 insertions(+)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 412b51e059c8..3e94e1f402ea 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -487,6 +487,9 @@ static inline unsigned int read_apic_id(void)
 	return apic->get_apic_id(reg);
 }
 
+typedef int (*wakeup_cpu_handler)(int apicid, unsigned long start_eip);
+extern void acpi_wake_cpu_handler_update(wakeup_cpu_handler handler);
+
 extern int default_apic_id_valid(u32 apicid);
 extern int default_acpi_madt_oem_check(char *, char *);
 extern void default_setup_apic_routing(void);
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 14cd3186dc77..fce2aa7d718f 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -65,6 +65,9 @@ int acpi_fix_pin2_polarity __initdata;
 static u64 acpi_lapic_addr __initdata = APIC_DEFAULT_PHYS_BASE;
 #endif
 
+static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
+static u64 acpi_mp_wake_mailbox_paddr;
+
 #ifdef CONFIG_X86_IO_APIC
 /*
  * Locks related to IOAPIC hotplug
@@ -329,6 +332,52 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e
 	return 0;
 }
 
+static void acpi_mp_wake_mailbox_init(void)
+{
+	if (acpi_mp_wake_mailbox)
+		return;
+
+	acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
+			sizeof(*acpi_mp_wake_mailbox), MEMREMAP_WB);
+}
+
+static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
+{
+	u8 timeout = 0xFF;
+
+	acpi_mp_wake_mailbox_init();
+
+	if (!acpi_mp_wake_mailbox)
+		return -EINVAL;
+
+	/*
+	 * Mailbox memory is shared between firmware and OS. Firmware will
+	 * listen on mailbox command address, and once it receives the wakeup
+	 * command, CPU associated with the given apicid will be booted. So,
+	 * the value of apic_id and wakeup_vector has to be set before updating
+	 * the wakeup command. So use WRITE_ONCE to let the compiler know about
+	 * it and preserve the order of writes.
+	 */
+	WRITE_ONCE(acpi_mp_wake_mailbox->apic_id, apicid);
+	WRITE_ONCE(acpi_mp_wake_mailbox->wakeup_vector, start_ip);
+	WRITE_ONCE(acpi_mp_wake_mailbox->command, ACPI_MP_WAKE_COMMAND_WAKEUP);
+
+	/*
+	 * After writing wakeup command, wait for maximum timeout of 0xFF
+	 * for firmware to reset the command address back zero to indicate
+	 * the successful reception of command.
+	 * NOTE: 255 as timeout value is decided based on our experiments.
+	 *
+	 * XXX: Change the timeout once ACPI specification comes up with
+	 *      standard maximum timeout value.
+	 */
+	while (READ_ONCE(acpi_mp_wake_mailbox->command) && timeout--)
+		cpu_relax();
+
+	/* If timedout, return error */
+	return timeout ? 0 : -EIO;
+}
+
 #endif				/*CONFIG_X86_LOCAL_APIC */
 
 #ifdef CONFIG_X86_IO_APIC
@@ -1086,6 +1135,30 @@ static int __init acpi_parse_madt_lapic_entries(void)
 	}
 	return 0;
 }
+
+static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+				      const unsigned long end)
+{
+	struct acpi_madt_multiproc_wakeup *mp_wake;
+
+	if (acpi_mp_wake_mailbox)
+		return -EINVAL;
+
+	if (!IS_ENABLED(CONFIG_SMP))
+		return -ENODEV;
+
+	mp_wake = (struct acpi_madt_multiproc_wakeup *) header;
+	if (BAD_MADT_ENTRY(mp_wake, end))
+		return -EINVAL;
+
+	acpi_table_print_madt_entry(&header->common);
+
+	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+
+	acpi_wake_cpu_handler_update(acpi_wakeup_cpu);
+
+	return 0;
+}
 #endif				/* CONFIG_X86_LOCAL_APIC */
 
 #ifdef	CONFIG_X86_IO_APIC
@@ -1284,6 +1357,12 @@ static void __init acpi_process_madt(void)
 
 				smp_found_config = 1;
 			}
+
+			/*
+			 * Parse MADT MP Wake entry.
+			 */
+			acpi_table_parse_madt(ACPI_MADT_TYPE_MULTIPROC_WAKEUP,
+					      acpi_parse_mp_wake, 1);
 		}
 		if (error == -EINVAL) {
 			/*
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 4f26700f314d..f1b90a4b89e8 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2554,6 +2554,14 @@ u32 x86_msi_msg_get_destid(struct msi_msg *msg, bool extid)
 }
 EXPORT_SYMBOL_GPL(x86_msi_msg_get_destid);
 
+void __init acpi_wake_cpu_handler_update(wakeup_cpu_handler handler)
+{
+	struct apic **drv;
+
+	for (drv = __apicdrivers; drv < __apicdrivers_end; drv++)
+		(*drv)->wakeup_secondary_cpu = handler;
+}
+
 /*
  * Override the generic EOI implementation with an optimized version.
  * Only called during early boot when only one CPU is active and with
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 21/32] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (19 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 20/32] x86/acpi, x86/boot: Add multiprocessor wake-up support Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-13  2:56   ` Dan Williams
  2021-04-26 18:01 ` [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot for TDX platforms Kuppuswamy Sathyanarayanan
                   ` (11 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kai Huang, Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a trampoline for booting APs in 64-bit mode via a software handoff
with BIOS, and use the new trampoline for the ACPI MP wake protocol used
by TDX.

Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
mode.  For the GDT pointer, create a new entry as the existing storage
for the pointer occupies the zero entry in the GDT itself.

Reported-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/realmode.h          |  1 +
 arch/x86/kernel/smpboot.c                |  5 +++
 arch/x86/realmode/rm/header.S            |  1 +
 arch/x86/realmode/rm/trampoline_64.S     | 49 +++++++++++++++++++++++-
 arch/x86/realmode/rm/trampoline_common.S |  5 ++-
 5 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 5db5d083c873..5066c8b35e7c 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -25,6 +25,7 @@ struct real_mode_header {
 	u32	sev_es_trampoline_start;
 #endif
 #ifdef CONFIG_X86_64
+	u32	trampoline_start64;
 	u32	trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 16703c35a944..27d8491d753a 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1036,6 +1036,11 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
 	unsigned long boot_error = 0;
 	unsigned long timeout;
 
+#ifdef CONFIG_X86_64
+	if (is_tdx_guest())
+		start_ip = real_mode_header->trampoline_start64;
+#endif
+
 	idle->thread.sp = (unsigned long)task_pt_regs(idle);
 	early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
 	initial_code = (unsigned long)start_secondary;
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 8c1db5bf5d78..2eb62be6d256 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
 	.long	pa_sev_es_trampoline_start
 #endif
 #ifdef CONFIG_X86_64
+	.long	pa_trampoline_start64
 	.long	pa_trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 84c5d1b33d10..12b734b1da8b 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,20 @@ SYM_CODE_START(startup_32)
 	movl	%eax, %cr3
 
 	# Set up EFER
+	movl	$MSR_EFER, %ecx
+	rdmsr
+	cmp	pa_tr_efer, %eax
+	jne	.Lwrite_efer
+	cmp	pa_tr_efer + 4, %edx
+	je	.Ldone_efer
+.Lwrite_efer:
 	movl	pa_tr_efer, %eax
 	movl	pa_tr_efer + 4, %edx
-	movl	$MSR_EFER, %ecx
 	wrmsr
 
+.Ldone_efer:
 	# Enable paging and in turn activate Long Mode
-	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	/*
@@ -161,6 +168,19 @@ SYM_CODE_START(startup_32)
 	ljmpl	$__KERNEL_CS, $pa_startup_64
 SYM_CODE_END(startup_32)
 
+SYM_CODE_START(pa_trampoline_compat)
+	/*
+	 * In compatibility mode.  Prep ESP and DX for startup_32, then disable
+	 * paging and complete the switch to legacy 32-bit mode.
+	 */
+	movl	$rm_stack_end, %esp
+	movw	$__KERNEL_DS, %dx
+
+	movl	$(X86_CR0_NE | X86_CR0_PE), %eax
+	movl	%eax, %cr0
+	ljmpl   $__KERNEL32_CS, $pa_startup_32
+SYM_CODE_END(pa_trampoline_compat)
+
 	.section ".text64","ax"
 	.code64
 	.balign 4
@@ -169,6 +189,20 @@ SYM_CODE_START(startup_64)
 	jmpq	*tr_start(%rip)
 SYM_CODE_END(startup_64)
 
+SYM_CODE_START(trampoline_start64)
+	/*
+	 * APs start here on a direct transfer from 64-bit BIOS with identity
+	 * mapped page tables.  Load the kernel's GDT in order to gear down to
+	 * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
+	 * segment registers.  Load the zero IDT so any fault triggers a
+	 * shutdown instead of jumping back into BIOS.
+	 */
+	lidt	tr_idt(%rip)
+	lgdt	tr_gdt64(%rip)
+
+	ljmpl	*tr_compat(%rip)
+SYM_CODE_END(trampoline_start64)
+
 	.section ".rodata","a"
 	# Duplicate the global descriptor table
 	# so the kernel can live anywhere
@@ -182,6 +216,17 @@ SYM_DATA_START(tr_gdt)
 	.quad	0x00cf93000000ffff	# __KERNEL_DS
 SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
 
+SYM_DATA_START(tr_gdt64)
+	.short	tr_gdt_end - tr_gdt - 1	# gdt limit
+	.long	pa_tr_gdt
+	.long	0
+SYM_DATA_END(tr_gdt64)
+
+SYM_DATA_START(tr_compat)
+	.long	pa_trampoline_compat
+	.short	__KERNEL32_CS
+SYM_DATA_END(tr_compat)
+
 	.bss
 	.balign	PAGE_SIZE
 SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
index 5033e640f957..506d5897112a 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -1,4 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 	.section ".rodata","a"
 	.balign	16
-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+SYM_DATA_START_LOCAL(tr_idt)
+	.short	0
+	.quad	0
+SYM_DATA_END(tr_idt)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot for TDX platforms
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (20 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 21/32] x86/boot: Add a trampoline for APs booting in 64-bit mode Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-13  3:03   ` Dan Williams
  2021-04-26 18:01 ` [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process Kuppuswamy Sathyanarayanan
                   ` (10 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Avoid operations which will inject #VE during compressed
boot, which is obviously fatal for TDX platforms.

Details are,

 1. TDX module injects #VE if a TDX guest attempts to write
    EFER. So skip the WRMSR to set EFER.LME=1 if it's already
    set. TDX also forces EFER.LME=1, i.e. the branch will always
    be taken and thus the #VE avoided.

 2. TDX module also injects a #VE if the guest attempts to clear
    CR0.NE. Ensure CR0.NE is set when loading CR0 during compressed
    boot. The Setting CR0.NE should be a nop on all CPUs that
    support 64-bit mode.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S | 5 +++--
 arch/x86/boot/compressed/pgtable.h | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index e94874f4bbc1..37c2f37d4a0d 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -616,8 +616,9 @@ SYM_CODE_START(trampoline_32bit_src)
 	movl	$MSR_EFER, %ecx
 	rdmsr
 	btsl	$_EFER_LME, %eax
+	jc	1f
 	wrmsr
-	popl	%edx
+1:	popl	%edx
 	popl	%ecx
 
 	/* Enable PAE and LA57 (if required) paging modes */
@@ -636,7 +637,7 @@ SYM_CODE_START(trampoline_32bit_src)
 	pushl	%eax
 
 	/* Enable paging again */
-	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
+	movl	$(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	lret
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
 
 #define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE	0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE	0x80
 
 #define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (21 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot for TDX platforms Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-13  3:23   ` Dan Williams
  2021-04-26 18:01 ` [RFC v2 24/32] x86/topology: Disable CPU online/offline control for TDX guest Kuppuswamy Sathyanarayanan
                   ` (9 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Skip writing EFER during secondary_startup_64() if the current value is
also the desired value. This avoids a #VE when running as a TDX guest,
as the TDX-Module does not allow writes to EFER (even when writing the
current, fixed value).

Also, preserve CR4.MCE instead of clearing it during boot to avoid a #VE
when running as a TDX guest. The TDX-Module (effectively part of the
hypervisor) requires CR4.MCE to be set at all times and injects a #VE
if the guest attempts to clear CR4.MCE.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S |  5 ++++-
 arch/x86/kernel/head_64.S          | 13 +++++++++++--
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 37c2f37d4a0d..2d79e5f97360 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -622,7 +622,10 @@ SYM_CODE_START(trampoline_32bit_src)
 	popl	%ecx
 
 	/* Enable PAE and LA57 (if required) paging modes */
-	movl	$X86_CR4_PAE, %eax
+	movl	%cr4, %eax
+	/* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
+	andl	$X86_CR4_MCE, %eax
+	orl	$X86_CR4_PAE, %eax
 	testl	%edx, %edx
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..92c77cf75542 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 1:
 
 	/* Enable PAE mode, PGE and LA57 */
-	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+	movq	%cr4, %rcx
+	/* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
+	andl	$X86_CR4_MCE, %ecx
+	orl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
 #ifdef CONFIG_X86_5LEVEL
 	testl	$1, __pgtable_l5_enabled(%rip)
 	jz	1f
@@ -229,13 +232,19 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	/* Setup EFER (Extended Feature Enable Register) */
 	movl	$MSR_EFER, %ecx
 	rdmsr
+	movl    %eax, %edx
 	btsl	$_EFER_SCE, %eax	/* Enable System Call */
 	btl	$20,%edi		/* No Execute supported? */
 	jnc     1f
 	btsl	$_EFER_NX, %eax
 	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
-1:	wrmsr				/* Make changes effective */
 
+	/* Skip the WRMSR if the current value matches the desired value. */
+1:	cmpl	%edx, %eax
+	je	1f
+	xor	%edx, %edx
+	wrmsr				/* Make changes effective */
+1:
 	/* Setup cr0 */
 	movl	$CR0_STATE, %eax
 	/* Make changes effective */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 24/32] x86/topology: Disable CPU online/offline control for TDX guest
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (22 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 25/32] x86/tdx: Forcefully disable legacy PIC for TDX guests Kuppuswamy Sathyanarayanan
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

As per Intel TDX Virtual Firmware Design Guide, sec 4.3.5 and
sec 9.4, all unused CPUs are put in spinning state by
TDVF until OS requests for CPU bring-up via mailbox address passed
by ACPI MADT table. Since by default all unused CPUs are always in
spinning state, there is no point in supporting dynamic CPU
online/offline feature. So current generation of TDVF does not
support CPU hotplug feature. It may be supported in next generation.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/tdx.c      | 14 ++++++++++++++
 arch/x86/kernel/topology.c |  3 ++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 294dda5bf3f6..ab1efa4d10e9 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -316,6 +316,17 @@ static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 	return insn.length;
 }
 
+static int tdg_cpu_offline_prepare(unsigned int cpu)
+{
+	/*
+	 * Per Intel TDX Virtual Firmware Design Guide,
+	 * sec 4.3.5 and sec 9.4, Hotplug is not supported
+	 * in TDX platforms. So don't support CPU
+	 * offline feature once its turned on.
+	 */
+	return -EOPNOTSUPP;
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -410,5 +421,8 @@ void __init tdx_early_init(void)
 	pv_ops.irq.safe_halt = tdg_safe_halt;
 	pv_ops.irq.halt = tdg_halt;
 
+	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdg:cpu_hotplug",
+			  NULL, tdg_cpu_offline_prepare);
+
 	pr_info("TDX guest is initialized\n");
 }
diff --git a/arch/x86/kernel/topology.c b/arch/x86/kernel/topology.c
index f5477eab5692..d879ea96d79c 100644
--- a/arch/x86/kernel/topology.c
+++ b/arch/x86/kernel/topology.c
@@ -34,6 +34,7 @@
 #include <linux/irq.h>
 #include <asm/io_apic.h>
 #include <asm/cpu.h>
+#include <asm/tdx.h>
 
 static DEFINE_PER_CPU(struct x86_cpu, cpu_devices);
 
@@ -130,7 +131,7 @@ int arch_register_cpu(int num)
 			}
 		}
 	}
-	if (num || cpu0_hotpluggable)
+	if ((num || cpu0_hotpluggable) && !is_tdx_guest())
 		per_cpu(cpu_devices, num).cpu.hotpluggable = 1;
 
 	return register_cpu(&per_cpu(cpu_devices, num).cpu, num);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 25/32] x86/tdx: Forcefully disable legacy PIC for TDX guests
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (23 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 24/32] x86/topology: Disable CPU online/offline control for TDX guest Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code Kuppuswamy Sathyanarayanan
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Disable the legacy PIC (8259) for TDX guests as the PIC cannot be
supported by the VMM. TDX Module does not allow direct IRQ injection,
and using posted interrupt style delivery requires the guest to EOI
the IRQ, which diverges from the legacy PIC behavior.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ab1efa4d10e9..1f1bb98e1d38 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -4,6 +4,7 @@
 #define pr_fmt(fmt) "TDX: " fmt
 
 #include <asm/tdx.h>
+#include <asm/i8259.h>
 #include <asm/vmx.h>
 #include <asm/insn.h>
 #include <linux/sched/signal.h> /* force_sig_fault() */
@@ -421,6 +422,8 @@ void __init tdx_early_init(void)
 	pv_ops.irq.safe_halt = tdg_safe_halt;
 	pv_ops.irq.halt = tdg_halt;
 
+	legacy_pic = &null_legacy_pic;
+
 	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdg:cpu_hotplug",
 			  NULL, tdg_cpu_offline_prepare);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (24 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 25/32] x86/tdx: Forcefully disable legacy PIC for TDX guests Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-07 21:54   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Kuppuswamy Sathyanarayanan
                   ` (6 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Intel TDX doesn't allow VMM to access guest memory. Any memory
that is required for communication with VMM must be shared
explicitly by setting the bit in page table entry. And, after
setting the shared bit, the conversion must be completed with
MapGPA TDVMALL. The call informs VMM about the conversion and
makes it remove the GPA from the S-EPT mapping. The shared
memory is similar to unencrypted memory in AMD SME/SEV terminology
but the underlying process of sharing/un-sharing the memory is
different for Intel TDX guest platform.

SEV assumes that I/O devices can only do DMA to "decrypted"
physical addresses without the C-bit set.  In order for the CPU
to interact with this memory, the CPU needs a decrypted mapping.
To add this support, AMD SME code forces force_dma_unencrypted()
to return true for platforms that support AMD SEV feature. It will
be used for DMA memory allocation API to trigger
set_memory_decrypted() for platforms that support AMD SEV feature.

TDX is similar.  TDX architecturally prevents access to private
guest memory by anything other than the guest itself. This means that
any DMA buffers must be shared.

So move force_dma_unencrypted() out of AMD specific code.
   
It will be modified to return true for Intel TDX guest platform,
similar to AMD SEV feature.

Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
selected by all x86 memory encryption features. This will be
selected by both AMD SEV and Intel TDX guest config options.

This is preparation for TDX changes in DMA code and it has not
functional change.    

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/Kconfig                 |  8 +++++--
 arch/x86/mm/Makefile             |  2 ++
 arch/x86/mm/mem_encrypt.c        | 30 -------------------------
 arch/x86/mm/mem_encrypt_common.c | 38 ++++++++++++++++++++++++++++++++
 4 files changed, 46 insertions(+), 32 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 932e6d759ba7..67f99bf27729 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1529,14 +1529,18 @@ config X86_CPA_STATISTICS
 	  helps to determine the effectiveness of preserving large and huge
 	  page mappings when mapping protections are changed.
 
+config X86_MEM_ENCRYPT_COMMON
+	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+	select DYNAMIC_PHYSICAL_MASK
+	def_bool n
+
 config AMD_MEM_ENCRYPT
 	bool "AMD Secure Memory Encryption (SME) support"
 	depends on X86_64 && CPU_SUP_AMD
 	select DMA_COHERENT_POOL
-	select DYNAMIC_PHYSICAL_MASK
 	select ARCH_USE_MEMREMAP_PROT
-	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
 	select INSTRUCTION_DECODER
+	select X86_MEM_ENCRYPT_COMMON
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
 
+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON)	+= mem_encrypt_common.o
+
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index ae78cef79980..6f713c6a32b2 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -15,10 +15,6 @@
 #include <linux/dma-direct.h>
 #include <linux/swiotlb.h>
 #include <linux/mem_encrypt.h>
-#include <linux/device.h>
-#include <linux/kernel.h>
-#include <linux/bitops.h>
-#include <linux/dma-mapping.h>
 
 #include <asm/tlbflush.h>
 #include <asm/fixmap.h>
@@ -390,32 +386,6 @@ bool noinstr sev_es_active(void)
 	return sev_status & MSR_AMD64_SEV_ES_ENABLED;
 }
 
-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
-{
-	/*
-	 * For SEV, all DMA must be to unencrypted addresses.
-	 */
-	if (sev_active())
-		return true;
-
-	/*
-	 * For SME, all DMA must be to unencrypted addresses if the
-	 * device does not support DMA to addresses that include the
-	 * encryption mask.
-	 */
-	if (sme_active()) {
-		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
-		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
-						dev->bus_dma_limit);
-
-		if (dma_dev_mask <= dma_enc_mask)
-			return true;
-	}
-
-	return false;
-}
-
 void __init mem_encrypt_free_decrypted_mem(void)
 {
 	unsigned long vaddr, vaddr_end, npages;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..964e04152417
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2016 Advanced Micro Devices, Inc.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/mem_encrypt.h>
+#include <linux/dma-mapping.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+	/*
+	 * For SEV, all DMA must be to unencrypted/shared addresses.
+	 */
+	if (sev_active())
+		return true;
+
+	/*
+	 * For SME, all DMA must be to unencrypted addresses if the
+	 * device does not support DMA to addresses that include the
+	 * encryption mask.
+	 */
+	if (sme_active()) {
+		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
+		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
+						dev->bus_dma_limit);
+
+		if (dma_dev_mask <= dma_enc_mask)
+			return true;
+	}
+
+	return false;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (25 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-19  5:00   ` Kuppuswamy, Sathyanarayanan
  2021-05-19 16:14   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 28/32] x86/tdx: Make pages shared in ioremap() Kuppuswamy Sathyanarayanan
                   ` (5 subsequent siblings)
  32 siblings, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

tdx_shared_mask() returns the mask that has to be set in a page
table entry to make page shared with VMM.

Also, note that we cannot club shared mapping configuration between
AMD SME and Intel TDX Guest platforms in common function. SME has
to do it very early in __startup_64() as it sets the bit on all
memory, except what is used for communication. TDX can postpone as
we don't need any shared mapping in very early boot.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/Kconfig           | 1 +
 arch/x86/include/asm/tdx.h | 6 ++++++
 arch/x86/kernel/tdx.c      | 9 +++++++++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 67f99bf27729..5f92e8205de2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -882,6 +882,7 @@ config INTEL_TDX_GUEST
 	select PARAVIRT_XL
 	select X86_X2APIC
 	select SECURITY_LOCKDOWN_LSM
+	select X86_MEM_ENCRYPT_COMMON
 	help
 	  Provide support for running in a trusted domain on Intel processors
 	  equipped with Trusted Domain eXtenstions. TDX is an new Intel
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b972c6531a53..dc80cf7f7d08 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -111,6 +111,8 @@ unsigned char tdg_inb(unsigned short port);
 unsigned short tdg_inw(unsigned short port);
 unsigned int tdg_inl(unsigned short port);
 
+extern phys_addr_t tdg_shared_mask(void);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
@@ -149,6 +151,10 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
 	return -ENODEV;
 }
 
+static inline phys_addr_t tdg_shared_mask(void)
+{
+	return 0;
+}
 #endif /* CONFIG_INTEL_TDX_GUEST */
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 1f1bb98e1d38..7e391cd7aa2b 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -76,6 +76,12 @@ bool is_tdx_guest(void)
 }
 EXPORT_SYMBOL_GPL(is_tdx_guest);
 
+/* The highest bit of a guest physical address is the "sharing" bit */
+phys_addr_t tdg_shared_mask(void)
+{
+	return 1ULL << (td_info.gpa_width - 1);
+}
+
 static void tdg_get_info(void)
 {
 	u64 ret;
@@ -87,6 +93,9 @@ static void tdg_get_info(void)
 
 	td_info.gpa_width = out.rcx & GENMASK(5, 0);
 	td_info.attributes = out.rdx;
+
+	/* Exclude Shared bit from the __PHYSICAL_MASK */
+	physical_mask &= ~tdg_shared_mask();
 }
 
 static __cpuidle void tdg_halt(void)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (26 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-07 21:55   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 29/32] x86/tdx: Add helper to do MapGPA TDVMALL Kuppuswamy Sathyanarayanan
                   ` (4 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

All ioremap()ed pages that are not backed by normal memory (NONE or
RESERVED) have to be mapped as shared.

Reuse the infrastructure we have for AMD SEV.

Note that DMA code doesn't use ioremap() to convert memory to shared as
DMA buffers backed by normal memory. DMA code make buffer shared with
set_memory_decrypted().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h | 3 +++
 arch/x86/mm/ioremap.c          | 8 +++++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..734e775605c0 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -21,6 +21,9 @@
 #define pgprot_encrypted(prot)	__pgprot(__sme_set(pgprot_val(prot)))
 #define pgprot_decrypted(prot)	__pgprot(__sme_clr(pgprot_val(prot)))
 
+/* Make the page accesable by VMM */
+#define pgprot_tdg_shared(prot) __pgprot(pgprot_val(prot) | tdg_shared_mask())
+
 #ifndef __ASSEMBLY__
 #include <asm/x86_init.h>
 #include <asm/fpu/xstate.h>
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 9e5ccc56f8e0..c0dac02f5b3f 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -87,12 +87,12 @@ static unsigned int __ioremap_check_ram(struct resource *res)
 }
 
 /*
- * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
- * there the whole memory is already encrypted.
+ * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or
+ * private in TDX case) because there the whole memory is already encrypted.
  */
 static unsigned int __ioremap_check_encrypted(struct resource *res)
 {
-	if (!sev_active())
+	if (!sev_active() && !is_tdx_guest())
 		return 0;
 
 	switch (res->desc) {
@@ -244,6 +244,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
 	prot = PAGE_KERNEL_IO;
 	if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
 		prot = pgprot_encrypted(prot);
+	else if (is_tdx_guest())
+		prot = pgprot_tdg_shared(prot);
 
 	switch (pcm) {
 	case _PAGE_CACHE_MODE_UC:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 29/32] x86/tdx: Add helper to do MapGPA TDVMALL
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (27 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 28/32] x86/tdx: Make pages shared in ioremap() Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-19 15:59   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 30/32] x86/tdx: Make DMA pages shared Kuppuswamy Sathyanarayanan
                   ` (3 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

MapGPA TDVMCALL requests the host VMM to map a GPA range as private or
shared memory mappings. Shared GPA mappings can be used for
communication beteen TD guest and host VMM, for example for
paravirtualized IO.

The new helper tdx_map_gpa() provides access to the operation.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/tdx.h | 13 +++++++++++++
 arch/x86/kernel/tdx.c      | 13 +++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index dc80cf7f7d08..4789798d7737 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -7,6 +7,11 @@
 
 #ifndef __ASSEMBLY__
 
+enum tdx_map_type {
+	TDX_MAP_PRIVATE,
+	TDX_MAP_SHARED,
+};
+
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
@@ -112,6 +117,8 @@ unsigned short tdg_inw(unsigned short port);
 unsigned int tdg_inl(unsigned short port);
 
 extern phys_addr_t tdg_shared_mask(void);
+extern int tdg_map_gpa(phys_addr_t gpa, int numpages,
+		       enum tdx_map_type map_type);
 
 #else // !CONFIG_INTEL_TDX_GUEST
 
@@ -155,6 +162,12 @@ static inline phys_addr_t tdg_shared_mask(void)
 {
 	return 0;
 }
+
+static inline int tdg_map_gpa(phys_addr_t gpa, int numpages,
+			      enum tdx_map_type map_type)
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_INTEL_TDX_GUEST */
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 7e391cd7aa2b..074136473011 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -15,6 +15,8 @@
 #include "tdx-kvm.c"
 #endif
 
+#define TDVMCALL_MAP_GPA	0x10001
+
 static struct {
 	unsigned int gpa_width;
 	unsigned long attributes;
@@ -98,6 +100,17 @@ static void tdg_get_info(void)
 	physical_mask &= ~tdg_shared_mask();
 }
 
+int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+{
+	u64 ret;
+
+	if (map_type == TDX_MAP_SHARED)
+		gpa |= tdg_shared_mask();
+
+	ret = tdvmcall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0);
+	return ret ? -EIO : 0;
+}
+
 static __cpuidle void tdg_halt(void)
 {
 	u64 ret;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 30/32] x86/tdx: Make DMA pages shared
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (28 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 29/32] x86/tdx: Add helper to do MapGPA TDVMALL Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-18  1:19   ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 31/32] x86/kvm: Use bounce buffers for TD guest Kuppuswamy Sathyanarayanan
                   ` (2 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Kai Huang,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Make force_dma_unencrypted() return true for TDX to get DMA pages mapped
as shared.

__set_memory_enc_dec() is now aware about TDX and sets Shared bit
accordingly following with relevant TDVMCALL.

Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when
converting memory to private.  If the VMM uses a common pool for private
and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA
(or on the first access to the private GPA), in which case TDX-Module will
hold the page in a non-present "pending" state until it is explicitly
accepted.

BUG() if TDACCEPTPAGE fails (except the above case), as the guest is
completely hosed if it can't access memory.

Tested-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/tdx.h       |  3 ++
 arch/x86/kernel/tdx.c            | 26 ++++++++++++++++-
 arch/x86/mm/mem_encrypt_common.c |  4 +--
 arch/x86/mm/pat/set_memory.c     | 48 ++++++++++++++++++++++++++------
 4 files changed, 70 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4789798d7737..2794bf71e45c 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -19,6 +19,9 @@ enum tdx_map_type {
 
 #define TDINFO			1
 #define TDGETVEINFO		3
+#define TDACCEPTPAGE		6
+
+#define TDX_PAGE_ALREADY_ACCEPTED	0x8000000000000001
 
 struct tdcall_output {
 	u64 rcx;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 074136473011..44dd12c693d0 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -100,7 +100,8 @@ static void tdg_get_info(void)
 	physical_mask &= ~tdg_shared_mask();
 }
 
-int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+static int __tdg_map_gpa(phys_addr_t gpa, int numpages,
+			 enum tdx_map_type map_type)
 {
 	u64 ret;
 
@@ -111,6 +112,29 @@ int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
 	return ret ? -EIO : 0;
 }
 
+static void tdg_accept_page(phys_addr_t gpa)
+{
+	u64 ret;
+
+	ret = __tdcall(TDACCEPTPAGE, gpa, 0, 0, 0, NULL);
+
+	BUG_ON(ret && ret != TDX_PAGE_ALREADY_ACCEPTED);
+}
+
+int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+{
+	int ret, i;
+
+	ret = __tdg_map_gpa(gpa, numpages, map_type);
+	if (ret || map_type == TDX_MAP_SHARED)
+		return ret;
+
+	for (i = 0; i < numpages; i++)
+		tdg_accept_page(gpa + i*PAGE_SIZE);
+
+	return 0;
+}
+
 static __cpuidle void tdg_halt(void)
 {
 	u64 ret;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 964e04152417..b6d93b0c5dcf 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -15,9 +15,9 @@
 bool force_dma_unencrypted(struct device *dev)
 {
 	/*
-	 * For SEV, all DMA must be to unencrypted/shared addresses.
+	 * For SEV and TDX, all DMA must be to unencrypted/shared addresses.
 	 */
-	if (sev_active())
+	if (sev_active() || is_tdx_guest())
 		return true;
 
 	/*
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..ea78c7907847 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -27,6 +27,7 @@
 #include <asm/proto.h>
 #include <asm/memtype.h>
 #include <asm/set_memory.h>
+#include <asm/tdx.h>
 
 #include "../mm_internal.h"
 
@@ -1972,13 +1973,15 @@ int set_memory_global(unsigned long addr, int numpages)
 				    __pgprot(_PAGE_GLOBAL), 0);
 }
 
-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
+static int __set_memory_protect(unsigned long addr, int numpages, bool protect)
 {
+	pgprot_t mem_protected_bits, mem_plain_bits;
 	struct cpa_data cpa;
+	enum tdx_map_type map_type;
 	int ret;
 
-	/* Nothing to do if memory encryption is not active */
-	if (!mem_encrypt_active())
+	/* Nothing to do if memory encryption and TDX are not active */
+	if (!mem_encrypt_active() && !is_tdx_guest())
 		return 0;
 
 	/* Should not be working on unaligned addresses */
@@ -1988,8 +1991,25 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	memset(&cpa, 0, sizeof(cpa));
 	cpa.vaddr = &addr;
 	cpa.numpages = numpages;
-	cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
-	cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+
+	if (is_tdx_guest()) {
+		mem_protected_bits = __pgprot(0);
+		mem_plain_bits = __pgprot(tdg_shared_mask());
+	} else {
+		mem_protected_bits = __pgprot(_PAGE_ENC);
+		mem_plain_bits = __pgprot(0);
+	}
+
+	if (protect) {
+		cpa.mask_set = mem_protected_bits;
+		cpa.mask_clr = mem_plain_bits;
+		map_type = TDX_MAP_PRIVATE;
+	} else {
+		cpa.mask_set = mem_plain_bits;
+		cpa.mask_clr = mem_protected_bits;
+		map_type = TDX_MAP_SHARED;
+	}
+
 	cpa.pgd = init_mm.pgd;
 
 	/* Must avoid aliasing mappings in the highmem code */
@@ -1998,8 +2018,16 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 
 	/*
 	 * Before changing the encryption attribute, we need to flush caches.
+	 *
+	 * For TDX we need to flush caches on private->shared. VMM is
+	 * responsible for flushing on shared->private.
 	 */
-	cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+	if (is_tdx_guest()) {
+		if (map_type == TDX_MAP_SHARED)
+			cpa_flush(&cpa, 1);
+	} else {
+		cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+	}
 
 	ret = __change_page_attr_set_clr(&cpa, 1);
 
@@ -2012,18 +2040,22 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	 */
 	cpa_flush(&cpa, 0);
 
+	if (!ret && is_tdx_guest()) {
+		ret = tdg_map_gpa(__pa(addr), numpages, map_type);
+	}
+
 	return ret;
 }
 
 int set_memory_encrypted(unsigned long addr, int numpages)
 {
-	return __set_memory_enc_dec(addr, numpages, true);
+	return __set_memory_protect(addr, numpages, true);
 }
 EXPORT_SYMBOL_GPL(set_memory_encrypted);
 
 int set_memory_decrypted(unsigned long addr, int numpages)
 {
-	return __set_memory_enc_dec(addr, numpages, false);
+	return __set_memory_protect(addr, numpages, false);
 }
 EXPORT_SYMBOL_GPL(set_memory_decrypted);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 31/32] x86/kvm: Use bounce buffers for TD guest
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (29 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 30/32] x86/tdx: Make DMA pages shared Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-06-01  2:03   ` [RFC v2-fix-v1 1/1] " Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 32/32] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kuppuswamy Sathyanarayanan
  2021-05-03 23:21 ` [RFC v2 00/32] Add TDX Guest Support Kuppuswamy, Sathyanarayanan
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

TDX doesn't allow to perform DMA access to guest private memory.
In order for DMA to work properly in TD guest, user SWIOTLB bounce
buffers.

Move AMD SEV initialization into common code and adopt for TDX.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/io.h        |  3 +-
 arch/x86/kernel/pci-swiotlb.c    |  2 +-
 arch/x86/kernel/tdx.c            |  3 ++
 arch/x86/mm/mem_encrypt.c        | 45 ------------------------------
 arch/x86/mm/mem_encrypt_common.c | 47 ++++++++++++++++++++++++++++++++
 5 files changed, 53 insertions(+), 47 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 30a3b30395ad..658d9c2c2a9a 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -257,10 +257,11 @@ static inline void slow_down_io(void)
 
 #endif
 
+extern struct static_key_false sev_enable_key;
+
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 #include <linux/jump_label.h>
 
-extern struct static_key_false sev_enable_key;
 static inline bool sev_key_active(void)
 {
 	return static_branch_unlikely(&sev_enable_key);
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index c2cfa5e7c152..020e13749758 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -49,7 +49,7 @@ int __init pci_swiotlb_detect_4gb(void)
 	 * buffers are allocated and used for devices that do not support
 	 * the addressing range required for the encryption mask.
 	 */
-	if (sme_active())
+	if (sme_active() || is_tdx_guest())
 		swiotlb = 1;
 
 	return swiotlb;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 44dd12c693d0..6b07e7b4a69c 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -8,6 +8,7 @@
 #include <asm/vmx.h>
 #include <asm/insn.h>
 #include <linux/sched/signal.h> /* force_sig_fault() */
+#include <linux/swiotlb.h>
 
 #include <linux/cpu.h>
 
@@ -470,6 +471,8 @@ void __init tdx_early_init(void)
 
 	legacy_pic = &null_legacy_pic;
 
+	swiotlb_force = SWIOTLB_FORCE;
+
 	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdg:cpu_hotplug",
 			  NULL, tdg_cpu_offline_prepare);
 
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 6f713c6a32b2..761a98904aa2 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -409,48 +409,3 @@ void __init mem_encrypt_free_decrypted_mem(void)
 
 	free_init_pages("unused decrypted", vaddr, vaddr_end);
 }
-
-static void print_mem_encrypt_feature_info(void)
-{
-	pr_info("AMD Memory Encryption Features active:");
-
-	/* Secure Memory Encryption */
-	if (sme_active()) {
-		/*
-		 * SME is mutually exclusive with any of the SEV
-		 * features below.
-		 */
-		pr_cont(" SME\n");
-		return;
-	}
-
-	/* Secure Encrypted Virtualization */
-	if (sev_active())
-		pr_cont(" SEV");
-
-	/* Encrypted Register State */
-	if (sev_es_active())
-		pr_cont(" SEV-ES");
-
-	pr_cont("\n");
-}
-
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
-{
-	if (!sme_me_mask)
-		return;
-
-	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
-	swiotlb_update_mem_attributes();
-
-	/*
-	 * With SEV, we need to unroll the rep string I/O instructions,
-	 * but SEV-ES supports them through the #VC handler.
-	 */
-	if (sev_active() && !sev_es_active())
-		static_branch_enable(&sev_enable_key);
-
-	print_mem_encrypt_feature_info();
-}
-
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index b6d93b0c5dcf..625c15fa92f9 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,6 +10,7 @@
 #include <linux/mm.h>
 #include <linux/mem_encrypt.h>
 #include <linux/dma-mapping.h>
+#include <linux/swiotlb.h>
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
@@ -36,3 +37,49 @@ bool force_dma_unencrypted(struct device *dev)
 
 	return false;
 }
+
+static void print_amd_mem_encrypt_feature_info(void)
+{
+	pr_info("AMD Memory Encryption Features active:");
+
+	/* Secure Memory Encryption */
+	if (sme_active()) {
+		/*
+		 * SME is mutually exclusive with any of the SEV
+		 * features below.
+		 */
+		pr_cont(" SME\n");
+		return;
+	}
+
+	/* Secure Encrypted Virtualization */
+	if (sev_active())
+		pr_cont(" SEV");
+
+	/* Encrypted Register State */
+	if (sev_es_active())
+		pr_cont(" SEV-ES");
+
+	pr_cont("\n");
+}
+
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void)
+{
+	if (!sme_me_mask && !is_tdx_guest())
+		return;
+
+	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
+	swiotlb_update_mem_attributes();
+
+	/*
+	 * With SEV, we need to unroll the rep string I/O instructions,
+	 * but SEV-ES supports them through the #VC handler.
+	 */
+	if (sev_active() && !sev_es_active())
+		static_branch_enable(&sev_enable_key);
+
+	/* sme_me_mask !=0 means SME or SEV */
+	if (sme_me_mask)
+		print_amd_mem_encrypt_feature_info();
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2 32/32] x86/tdx: ioapic: Add shared bit for IOAPIC base address
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (30 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 31/32] x86/kvm: Use bounce buffers for TD guest Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-07 23:06   ` Dave Hansen
  2021-05-03 23:21 ` [RFC v2 00/32] Add TDX Guest Support Kuppuswamy, Sathyanarayanan
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata,
	Kuppuswamy Sathyanarayanan

From: Isaku Yamahata <isaku.yamahata@intel.com>

IOAPIC is emulated by KVM which means its MMIO address is shared
by host. Add shared bit for base address of IOAPIC.
Most MMIO region is handled by ioremap which is already marked
as shared for TDX guest platform, but IOAPIC is an exception which
uses fixed map.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/apic/io_apic.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 73ff4dd426a8..2a01d4a82be7 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2675,6 +2675,14 @@ static struct resource * __init ioapic_setup_resources(void)
 	return res;
 }
 
+static void io_apic_set_fixmap_nocache(enum fixed_addresses idx, phys_addr_t phys)
+{
+	pgprot_t flags = FIXMAP_PAGE_NOCACHE;
+	if (is_tdx_guest())
+		flags = pgprot_tdg_shared(flags);
+	__set_fixmap(idx, phys, flags);
+}
+
 void __init io_apic_init_mappings(void)
 {
 	unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
@@ -2707,7 +2715,7 @@ void __init io_apic_init_mappings(void)
 				      __func__, PAGE_SIZE, PAGE_SIZE);
 			ioapic_phys = __pa(ioapic_phys);
 		}
-		set_fixmap_nocache(idx, ioapic_phys);
+		io_apic_set_fixmap_nocache(idx, ioapic_phys);
 		apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
 			__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
 			ioapic_phys);
@@ -2836,7 +2844,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
 	ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
 	ioapics[idx].mp_config.apicaddr = address;
 
-	set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
+	io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
 	if (bad_ioapic_register(idx)) {
 		clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
 		return -ENODEV;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-26 18:01 ` [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions Kuppuswamy Sathyanarayanan
@ 2021-04-26 20:32   ` Dave Hansen
  2021-04-26 22:31     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-04-26 20:32 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

> +/*
> + * Expose registers R10-R15 to VMM (for bitfield info
> + * refer to TDX GHCI specification).
> + */
> +#define TDVMCALL_EXPOSE_REGS_MASK	0xfc00

Why can't we do:

#define TDC_R10	BIT(18)
#define TDC_R11	BIT(19)

and:

#define TDVMCALL_EXPOSE_REGS_MASK	(TDX_R10 | TDX_R11 | TDX_R12 ...

or at least:

#define TDVMCALL_EXPOSE_REGS_MASK	BIT(18) | BIT(19) ...

?

> +/*
> + * TDX guests use the TDCALL instruction to make
> + * hypercalls to the VMM. It is supported in
> + * Binutils >= 2.36.
> + */
> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
> +
> +/*
> + * __tdcall()  - Used to communicate with the TDX module

Why is this function here?  What does it do?  Why do we need it?

I'd like this to actually talk about doing impedance matching between
the function call and TDCALL ABIs.

> + * @arg1 (RDI) - TDCALL Leaf ID
> + * @arg2 (RSI) - Input parameter 1 passed to TDX module
> + *               via register RCX
> + * @arg2 (RDX) - Input parameter 2 passed to TDX module
> + *               via register RDX
> + * @arg3 (RCX) - Input parameter 3 passed to TDX module
> + *               via register R8
> + * @arg4 (R8)  - Input parameter 4 passed to TDX module
> + *               via register R9

The unnecessary repitition and verbosity actually make this harder to
read.  This looks like it was easy to write, but not much effort is
being made to make it easy to consume.  Could you please apply some
consideration to making it more readable?


> + * @arg5 (R9)  - struct tdcall_output pointer
> + *
> + * @out        - Return status of tdcall via RAX.

Don't comments usually just say "returns ... foo"?  Also, the @params
usually refer to *REAL* variable names.  Where the heck does "out" come
from?  Why are you even putting argX?  Shouldn't these be @'s be their
literal function argument names?

	@rdi - Input parameter, moved to RCX

> + * NOTE: This function should only used for non TDVMCALL
> + *       use cases
> + */
> +SYM_FUNC_START(__tdcall)
> +	FRAME_BEGIN
> +
> +	/* Save non-volatile GPRs that are exposed to the VMM. */
> +	push %r15
> +	push %r14
> +	push %r13
> +	push %r12

Why do we have to save these?  Because they might be clobbered?  If so,
let's say *THAT* instead of just "exposed".  "Exposed" could mean "VMM
can read".

Also, this just told me that this function can't be used to talk to the
VMM.  Why is this talking about exposure to the VMM?

> +	/* Move TDCALL Leaf ID to RAX */
> +	mov %rdi, %rax
> +	/* Move output pointer to R12 */
> +	mov %r9, %r12

I thought 'struct tdcall_output' was a purely software construct.  Why
are we passing a pointer to it into TDCALL?

> +	/* Move input param 4 to R9 */
> +	mov %r8, %r9
> +	/* Move input param 3 to R8 */
> +	mov %rcx, %r8
> +	/* Leave input param 2 in RDX */
> +	/* Move input param 1 to RCX */
> +	mov %rsi, %rcx

With a little work, this can be made a *LOT* more readable:

	/* Mangle function call ABI into TDCALL ABI: */
	mov %rdi, %rax	/* Move TDCALL Leaf ID to RAX */
	mov %r9,  %r12 	/* Move output pointer to R12 */
	mov %r8,  %r9	/* Move input 4 to R9 */
	mov %rcx, %r8	/* Move input 3 to R8 */
	mov %rsi, %rcx	/* Move input 1 to RCX */
	/* Leave input param 2 in RDX */


> +	tdcall
> +
> +	/* Check for TDCALL success: 0 - Successful, otherwise failed */
> +	test %rax, %rax
> +	jnz 1f
> +
> +	/* Check for a TDCALL output struct */
> +	test %r12, %r12
> +	jz 1f

Does some universal status come back in r12?  Aren't we dealing with a
VMM/SEAM-controlled register here?  Isn't this dangerous?

> +	/* Copy TDCALL result registers to output struct: */
> +	movq %rcx, TDCALL_rcx(%r12)
> +	movq %rdx, TDCALL_rdx(%r12)
> +	movq %r8,  TDCALL_r8(%r12)
> +	movq %r9,  TDCALL_r9(%r12)
> +	movq %r10, TDCALL_r10(%r12)
> +	movq %r11, TDCALL_r11(%r12)
> +1:
> +	/* Zero out registers exposed to the TDX Module. */
> +	xor %rcx,  %rcx
> +	xor %rdx,  %rdx
> +	xor %r8d,  %r8d
> +	xor %r9d,  %r9d
> +	xor %r10d, %r10d
> +	xor %r11d, %r11d

... why?

> +	/* Restore non-volatile GPRs that are exposed to the VMM. */
> +	pop %r12
> +	pop %r13
> +	pop %r14
> +	pop %r15
> +
> +	FRAME_END
> +	ret
> +SYM_FUNC_END(__tdcall)
> +
> +/*
> + * do_tdvmcall()  - Used to communicate with the VMM.
> + *
> + * @arg1 (RDI)    - TDVMCALL function, e.g. exit reason
> + * @arg2 (RSI)    - Input parameter 1 passed to VMM
> + *                  via register R12
> + * @arg3 (RDX)    - Input parameter 2 passed to VMM
> + *                  via register R13
> + * @arg4 (RCX)    - Input parameter 3 passed to VMM
> + *                  via register R14
> + * @arg5 (R8)     - Input parameter 4 passed to VMM
> + *                  via register R15
> + * @arg6 (R9)     - struct tdvmcall_output pointer
> + *
> + * @out           - Return status of tdvmcall(R10) via RAX.
> + *
> + */

Same comments on the sparse comment style.

> +SYM_CODE_START_LOCAL(do_tdvmcall)
> +	FRAME_BEGIN
> +
> +	/* Save non-volatile GPRs that are exposed to the VMM. */
> +	push %r15
> +	push %r14
> +	push %r13
> +	push %r12
> +
> +	/* Set TDCALL leaf ID to TDVMCALL (0) in RAX */

I think there needs to be some discussion of what TDCALL and TDVMCALL
are.  They are named too similarly not to do so.

> +	xor %eax, %eax
> +	/* Move TDVMCALL function id (1st argument) to R11 */
> +	mov %rdi, %r11> +	/* Move Input parameter 1-4 to R12-R15 */
> +	mov %rsi, %r12
> +	mov %rdx, %r13
> +	mov %rcx, %r14
> +	mov %r8,  %r15
> +	/* Leave tdvmcall output pointer in R9 */
> +
> +	/*
> +	 * Value of RCX is used by the TDX Module to determine which
> +	 * registers are exposed to VMM. Each bit in RCX represents a
> +	 * register id. You can find the bitmap details from TDX GHCI
> +	 * spec.
> +	 */

This doesn't belong here.  Put it along with the
TDVMCALL_EXPOSE_REGS_MASK, please.

> +	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
> +
> +	tdcall
> +
> +	/*
> +	 * Check for TDCALL success: 0 - Successful, otherwise failed.
> +	 * If failed, there is an issue with TDX Module which is fatal
> +	 * for the guest. So panic.
> +	 */
> +	test %rax, %rax
> +	jnz 2f

So, just to be clear: %RAX is under the control of the SEAM module.  The
VMM has no control over it.  Right?

Shouldn't we say that explicitly?

> +	/* Move TDVMCALL success/failure to RAX to return to user */
> +	mov %r10, %rax
> +
> +	/* Check for TDVMCALL success: 0 - Successful, otherwise failed */
> +	test %rax, %rax
> +	jnz 1f
> +
> +	/* Check for a TDVMCALL output struct */
> +	test %r9, %r9
> +	jz 1f

I'd also include a note that %r9 was neither writable nor its value
exposed to the VMM.

> +	/* Copy TDVMCALL result registers to output struct: */
> +	movq %r11, TDVMCALL_r11(%r9)
> +	movq %r12, TDVMCALL_r12(%r9)
> +	movq %r13, TDVMCALL_r13(%r9)
> +	movq %r14, TDVMCALL_r14(%r9)
> +	movq %r15, TDVMCALL_r15(%r9)
> +1:
> +	/*
> +	 * Zero out registers exposed to the VMM to avoid
> +	 * speculative execution with VMM-controlled values.
> +	 */
> +	xor %r10d, %r10d
> +	xor %r11d, %r11d
> +	xor %r12d, %r12d
> +	xor %r13d, %r13d
> +	xor %r14d, %r14d
> +	xor %r15d, %r15d
> +
> +	/* Restore non-volatile GPRs that are exposed to the VMM. */
> +	pop %r12
> +	pop %r13
> +	pop %r14
> +	pop %r15
> +
> +	FRAME_END
> +	ret
> +2:
> +	ud2
> +SYM_CODE_END(do_tdvmcall)
> +
> +/* Helper function for standard type of TDVMCALL */
> +SYM_FUNC_START(__tdvmcall)
> +	/* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
> +	xor %r10, %r10
> +	call do_tdvmcall
> +	retq
> +SYM_FUNC_END(__tdvmcall)

Why do we need this helper?  Why does it need to be in assembly?

> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 6a7193fead08..29c52128b9c0 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -1,8 +1,44 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /* Copyright (C) 2020 Intel Corporation */
>  
> +#define pr_fmt(fmt) "TDX: " fmt
> +
>  #include <asm/tdx.h>
>  
> +/*
> + * Wrapper for use case that checks for error code and print warning message.
> + */

This comment isn't very useful.  I can see the error check and warning
by reading the code.

> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
> +{
> +	u64 err;
> +
> +	err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
> +
> +	if (err)
> +		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
> +				    fn, err);
> +
> +	return err;
> +}
> +
> +/*
> + * Wrapper for the semi-common case where we need single output value (R11).
> + */
> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
> +{
> +
> +	struct tdvmcall_output out = {0};
> +	u64 err;
> +
> +	err = __tdvmcall(fn, r12, r13, r14, r15, &out);
> +
> +	if (err)
> +		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
> +				    fn, err);
> +
> +	return out.r11;
> +}

How do callers check for errors?  Is the error value superfluously
returned in r11 and another output register?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option
  2021-04-26 18:01 ` [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option Kuppuswamy Sathyanarayanan
@ 2021-04-26 21:09   ` Randy Dunlap
  2021-04-26 22:32     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Randy Dunlap @ 2021-04-26 21:09 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> Add INTEL_TDX_GUEST config option to selectively compile
> TDX guest support.
> 
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/Kconfig | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 6b4b682af468..932e6d759ba7 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -875,6 +875,21 @@ config ACRN_GUEST
>  	  IOT with small footprint and real-time features. More details can be
>  	  found in https://projectacrn.org/.
>  
> +config INTEL_TDX_GUEST
> +	bool "Intel Trusted Domain eXtensions Guest Support"
> +	depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
> +	depends on SECURITY
> +	select PARAVIRT_XL
> +	select X86_X2APIC
> +	select SECURITY_LOCKDOWN_LSM
> +	help
> +	  Provide support for running in a trusted domain on Intel processors
> +	  equipped with Trusted Domain eXtenstions. TDX is an new Intel

	                                                   a new Intel

> +	  technology that extends VMX and Memory Encryption with a new kind of
> +	  virtual machine guest called Trust Domain (TD). A TD is designed to
> +	  run in a CPU mode that protects the confidentiality of TD memory
> +	  contents and the TD’s CPU state from other software, including VMM.
> +
>  endif #HYPERVISOR_GUEST
>  
>  source "arch/x86/Kconfig.cpu"
> 


-- 
~Randy


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-26 20:32   ` Dave Hansen
@ 2021-04-26 22:31     ` Kuppuswamy, Sathyanarayanan
  2021-04-26 23:17       ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-26 22:31 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 4/26/21 1:32 PM, Dave Hansen wrote:
>> +/*
>> + * Expose registers R10-R15 to VMM (for bitfield info
>> + * refer to TDX GHCI specification).
>> + */
>> +#define TDVMCALL_EXPOSE_REGS_MASK	0xfc00
> 
> Why can't we do:
> 
> #define TDC_R10	BIT(18)
> #define TDC_R11	BIT(19)
> 
> and:
> 
> #define TDVMCALL_EXPOSE_REGS_MASK	(TDX_R10 | TDX_R11 | TDX_R12 ...
> 
> or at least:
> 
> #define TDVMCALL_EXPOSE_REGS_MASK	BIT(18) | BIT(19) ...

If this is the preferred way, I will change it use macros (TDX_Rxx).

> 
> ?
> 
>> +/*
>> + * TDX guests use the TDCALL instruction to make
>> + * hypercalls to the VMM. It is supported in
>> + * Binutils >= 2.36.
>> + */
>> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
>> +
>> +/*
>> + * __tdcall()  - Used to communicate with the TDX module
> 
> Why is this function here?  What does it do?  Why do we need it?

__tdcall() function is used to request services from the TDX Module.
Example use cases are, TDREPORT, VEINFO, TDINFO, etc.

> 
> I'd like this to actually talk about doing impedance matching between
> the function call and TDCALL ABIs.
> 
>> + * @arg1 (RDI) - TDCALL Leaf ID
>> + * @arg2 (RSI) - Input parameter 1 passed to TDX module
>> + *               via register RCX
>> + * @arg2 (RDX) - Input parameter 2 passed to TDX module
>> + *               via register RDX
>> + * @arg3 (RCX) - Input parameter 3 passed to TDX module
>> + *               via register R8
>> + * @arg4 (R8)  - Input parameter 4 passed to TDX module
>> + *               via register R9
> 
> The unnecessary repitition and verbosity actually make this harder to
> read.  This looks like it was easy to write, but not much effort is
> being made to make it easy to consume.  Could you please apply some
> consideration to making it more readable?
> 
> 
>> + * @arg5 (R9)  - struct tdcall_output pointer
>> + *
>> + * @out        - Return status of tdcall via RAX.
> 
> Don't comments usually just say "returns ... foo"?  Also, the @params
> usually refer to *REAL* variable names.  Where the heck does "out" come
> from?  Why are you even putting argX?  Shouldn't these be @'s be their
> literal function argument names?

I have added this comment block to make it easier for us to understand
the register mapping between function arguments and TDCALL ABI. But I got
your point. Usage of @arg1 or @out does not comply the function comment
standards. I will fix this in next version.

> 
> 	@rdi - Input parameter, moved to RCX

I will use the above format to document function arguments.

> 
>> + * NOTE: This function should only used for non TDVMCALL
>> + *       use cases
>> + */
>> +SYM_FUNC_START(__tdcall)
>> +	FRAME_BEGIN
>> +
>> +	/* Save non-volatile GPRs that are exposed to the VMM. */
>> +	push %r15
>> +	push %r14
>> +	push %r13
>> +	push %r12
> 
> Why do we have to save these?  Because they might be clobbered?  If so,
> let's say *THAT* instead of just "exposed".  "Exposed" could mean "VMM
> can read".
> 
> Also, this just told me that this function can't be used to talk to the
> VMM.  Why is this talking about exposure to the VMM?

Although __tdcall() is only used to communicate with the TDX module and the
TDX module is not supposed to touch these registers, just to be on the safe
side, I have tried to save the context of registers R12-R15. Anyway cycles
used by instructions are less compared to tdcall.


> 
>> +	/* Move TDCALL Leaf ID to RAX */
>> +	mov %rdi, %rax
>> +	/* Move output pointer to R12 */
>> +	mov %r9, %r12
> 
> I thought 'struct tdcall_output' was a purely software construct.  Why
> are we passing a pointer to it into TDCALL?

Its used to store the TDCALL result (RCX, RDX, R8-R11). As far as this
function is concerned, its just a block of memory (accessed using
base address + TDCALL_r* offsets).

> 
>> +	/* Move input param 4 to R9 */
>> +	mov %r8, %r9
>> +	/* Move input param 3 to R8 */
>> +	mov %rcx, %r8
>> +	/* Leave input param 2 in RDX */
>> +	/* Move input param 1 to RCX */
>> +	mov %rsi, %rcx
> 
> With a little work, this can be made a *LOT* more readable:
> 
> 	/* Mangle function call ABI into TDCALL ABI: */
> 	mov %rdi, %rax	/* Move TDCALL Leaf ID to RAX */
> 	mov %r9,  %r12 	/* Move output pointer to R12 */
> 	mov %r8,  %r9	/* Move input 4 to R9 */
> 	mov %rcx, %r8	/* Move input 3 to R8 */
> 	mov %rsi, %rcx	/* Move input 1 to RCX */
> 	/* Leave input param 2 in RDX */

Ok. I will use your version.

> 
> 
>> +	tdcall
>> +
>> +	/* Check for TDCALL success: 0 - Successful, otherwise failed */
>> +	test %rax, %rax
>> +	jnz 1f
>> +
>> +	/* Check for a TDCALL output struct */
>> +	test %r12, %r12
>> +	jz 1f
> 
> Does some universal status come back in r12?  Aren't we dealing with a
> VMM/SEAM-controlled register here?  Isn't this dangerous?

R12 is the temporary register we have used to store the address of user
passed output pointer. We just check for NULL condition here. R12 will
not be used by the TDX module.

If you prefer, we can just push the output pointer to stack and get it
after we make the tdcall.

> 
>> +	/* Copy TDCALL result registers to output struct: */
>> +	movq %rcx, TDCALL_rcx(%r12)
>> +	movq %rdx, TDCALL_rdx(%r12)
>> +	movq %r8,  TDCALL_r8(%r12)
>> +	movq %r9,  TDCALL_r9(%r12)
>> +	movq %r10, TDCALL_r10(%r12)
>> +	movq %r11, TDCALL_r11(%r12)
>> +1:
>> +	/* Zero out registers exposed to the TDX Module. */
>> +	xor %rcx,  %rcx
>> +	xor %rdx,  %rdx
>> +	xor %r8d,  %r8d
>> +	xor %r9d,  %r9d
>> +	xor %r10d, %r10d
>> +	xor %r11d, %r11d
> 
> ... why?

These registers are used by the TDX Module. Why pass the stale values
back to the user? So we clear them here.

> 
>> +	/* Restore non-volatile GPRs that are exposed to the VMM. */
>> +	pop %r12
>> +	pop %r13
>> +	pop %r14
>> +	pop %r15
>> +
>> +	FRAME_END
>> +	ret
>> +SYM_FUNC_END(__tdcall)
>> +
>> +/*
>> + * do_tdvmcall()  - Used to communicate with the VMM.
>> + *
>> + * @arg1 (RDI)    - TDVMCALL function, e.g. exit reason
>> + * @arg2 (RSI)    - Input parameter 1 passed to VMM
>> + *                  via register R12
>> + * @arg3 (RDX)    - Input parameter 2 passed to VMM
>> + *                  via register R13
>> + * @arg4 (RCX)    - Input parameter 3 passed to VMM
>> + *                  via register R14
>> + * @arg5 (R8)     - Input parameter 4 passed to VMM
>> + *                  via register R15
>> + * @arg6 (R9)     - struct tdvmcall_output pointer
>> + *
>> + * @out           - Return status of tdvmcall(R10) via RAX.
>> + *
>> + */
> 
> Same comments on the sparse comment style.

will fix it similar to __tdcall().

> 
>> +SYM_CODE_START_LOCAL(do_tdvmcall)
>> +	FRAME_BEGIN
>> +
>> +	/* Save non-volatile GPRs that are exposed to the VMM. */
>> +	push %r15
>> +	push %r14
>> +	push %r13
>> +	push %r12
>> +
>> +	/* Set TDCALL leaf ID to TDVMCALL (0) in RAX */
> 
> I think there needs to be some discussion of what TDCALL and TDVMCALL
> are.  They are named too similarly not to do so.

TDVMCALL is the sub function of TDCALL (selected by setting RAX register
to 0). TDVMCALL is used to request services from VMM.

> 
>> +	xor %eax, %eax
>> +	/* Move TDVMCALL function id (1st argument) to R11 */
>> +	mov %rdi, %r11> +	/* Move Input parameter 1-4 to R12-R15 */
>> +	mov %rsi, %r12
>> +	mov %rdx, %r13
>> +	mov %rcx, %r14
>> +	mov %r8,  %r15
>> +	/* Leave tdvmcall output pointer in R9 */
>> +
>> +	/*
>> +	 * Value of RCX is used by the TDX Module to determine which
>> +	 * registers are exposed to VMM. Each bit in RCX represents a
>> +	 * register id. You can find the bitmap details from TDX GHCI
>> +	 * spec.
>> +	 */
> 
> This doesn't belong here.  Put it along with the
> TDVMCALL_EXPOSE_REGS_MASK, please.

Ok. I will do it.

> 
>> +	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
>> +
>> +	tdcall
>> +
>> +	/*
>> +	 * Check for TDCALL success: 0 - Successful, otherwise failed.
>> +	 * If failed, there is an issue with TDX Module which is fatal
>> +	 * for the guest. So panic.
>> +	 */
>> +	test %rax, %rax
>> +	jnz 2f
> 
> So, just to be clear: %RAX is under the control of the SEAM module.  The
> VMM has no control over it.  Right?

AFAIK, VMM will not touch it.

Sean, please confirm it.

> 
> Shouldn't we say that explicitly?

I can add it to above comment.

> 
>> +	/* Move TDVMCALL success/failure to RAX to return to user */
>> +	mov %r10, %rax
>> +
>> +	/* Check for TDVMCALL success: 0 - Successful, otherwise failed */
>> +	test %rax, %rax
>> +	jnz 1f
>> +
>> +	/* Check for a TDVMCALL output struct */
>> +	test %r9, %r9
>> +	jz 1f
> 
> I'd also include a note that %r9 was neither writable nor its value
> exposed to the VMM.

will do it.

> 
>> +	/* Copy TDVMCALL result registers to output struct: */
>> +	movq %r11, TDVMCALL_r11(%r9)
>> +	movq %r12, TDVMCALL_r12(%r9)
>> +	movq %r13, TDVMCALL_r13(%r9)
>> +	movq %r14, TDVMCALL_r14(%r9)
>> +	movq %r15, TDVMCALL_r15(%r9)
>> +1:
>> +	/*
>> +	 * Zero out registers exposed to the VMM to avoid
>> +	 * speculative execution with VMM-controlled values.
>> +	 */
>> +	xor %r10d, %r10d
>> +	xor %r11d, %r11d
>> +	xor %r12d, %r12d
>> +	xor %r13d, %r13d
>> +	xor %r14d, %r14d
>> +	xor %r15d, %r15d
>> +
>> +	/* Restore non-volatile GPRs that are exposed to the VMM. */
>> +	pop %r12
>> +	pop %r13
>> +	pop %r14
>> +	pop %r15
>> +
>> +	FRAME_END
>> +	ret
>> +2:
>> +	ud2
>> +SYM_CODE_END(do_tdvmcall)
>> +
>> +/* Helper function for standard type of TDVMCALL */
>> +SYM_FUNC_START(__tdvmcall)
>> +	/* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
>> +	xor %r10, %r10
>> +	call do_tdvmcall
>> +	retq
>> +SYM_FUNC_END(__tdvmcall)
> 
> Why do we need this helper?  Why does it need to be in assembly?

Its simpler to do it in assembly. Also, grouping all register updates
in the same file will make it easier for us to read or debug issues. Another
reason is, we also call do_tdvmcall() from in/out instruction use case.

> 
>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>> index 6a7193fead08..29c52128b9c0 100644
>> --- a/arch/x86/kernel/tdx.c
>> +++ b/arch/x86/kernel/tdx.c
>> @@ -1,8 +1,44 @@
>>   // SPDX-License-Identifier: GPL-2.0
>>   /* Copyright (C) 2020 Intel Corporation */
>>   
>> +#define pr_fmt(fmt) "TDX: " fmt
>> +
>>   #include <asm/tdx.h>
>>   
>> +/*
>> + * Wrapper for use case that checks for error code and print warning message.
>> + */
> 
> This comment isn't very useful.  I can see the error check and warning
> by reading the code.

Its just a helper function that covers common case of checking for error
and print the warning message. If this comment is superfluous, I can remove
it.

> 
>> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>> +{
>> +	u64 err;
>> +
>> +	err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
>> +
>> +	if (err)
>> +		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>> +				    fn, err);
>> +
>> +	return err;
>> +}
>> +
>> +/*
>> + * Wrapper for the semi-common case where we need single output value (R11).
>> + */
>> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>> +{
>> +
>> +	struct tdvmcall_output out = {0};
>> +	u64 err;
>> +
>> +	err = __tdvmcall(fn, r12, r13, r14, r15, &out);
>> +
>> +	if (err)
>> +		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>> +				    fn, err);
>> +
>> +	return out.r11;
>> +}
> 
> How do callers check for errors?  Is the error value superfluously
> returned in r11 and another output register?

We already check for error in this helper function. User of this function
only cares about output value (R11). Mainly for in/out use case.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option
  2021-04-26 21:09   ` Randy Dunlap
@ 2021-04-26 22:32     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-26 22:32 UTC (permalink / raw)
  To: Randy Dunlap, Peter Zijlstra, Andy Lutomirski, Dave Hansen,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 4/26/21 2:09 PM, Randy Dunlap wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>> Add INTEL_TDX_GUEST config option to selectively compile
>> TDX guest support.
>>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>> Reviewed-by: Tony Luck <tony.luck@intel.com>
>> ---
>>   arch/x86/Kconfig | 15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 6b4b682af468..932e6d759ba7 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -875,6 +875,21 @@ config ACRN_GUEST
>>   	  IOT with small footprint and real-time features. More details can be
>>   	  found in https://projectacrn.org/.
>>   
>> +config INTEL_TDX_GUEST
>> +	bool "Intel Trusted Domain eXtensions Guest Support"
>> +	depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
>> +	depends on SECURITY
>> +	select PARAVIRT_XL
>> +	select X86_X2APIC
>> +	select SECURITY_LOCKDOWN_LSM
>> +	help
>> +	  Provide support for running in a trusted domain on Intel processors
>> +	  equipped with Trusted Domain eXtenstions. TDX is an new Intel
> 
> 	                                                   a new Intel
> 

Good catch. I will fix it in next version.

>> +	  technology that extends VMX and Memory Encryption with a new kind of
>> +	  virtual machine guest called Trust Domain (TD). A TD is designed to
>> +	  run in a CPU mode that protects the confidentiality of TD memory
>> +	  contents and the TD’s CPU state from other software, including VMM.
>> +
>>   endif #HYPERVISOR_GUEST
>>   
>>   source "arch/x86/Kconfig.cpu"
>>
> 
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-26 22:31     ` Kuppuswamy, Sathyanarayanan
@ 2021-04-26 23:17       ` Dave Hansen
  2021-04-27  2:29         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-04-26 23:17 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 3:31 PM, Kuppuswamy, Sathyanarayanan wrote:
>>> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
>>> +
>>> +/*
>>> + * __tdcall()  - Used to communicate with the TDX module
>>
>> Why is this function here?  What does it do?  Why do we need it?
> 
> __tdcall() function is used to request services from the TDX Module.
> Example use cases are, TDREPORT, VEINFO, TDINFO, etc.

I think there might be some misinterpretation of my question.  What you
are describing is what *TDCALL* does.  Why do we need a wrapper
function?  What purpose does this wrapper function serve?  Why do we
need this wrapper function?

>>> + * NOTE: This function should only used for non TDVMCALL
>>> + *       use cases
>>> + */
>>> +SYM_FUNC_START(__tdcall)
>>> +    FRAME_BEGIN
>>> +
>>> +    /* Save non-volatile GPRs that are exposed to the VMM. */
>>> +    push %r15
>>> +    push %r14
>>> +    push %r13
>>> +    push %r12
>>
>> Why do we have to save these?  Because they might be clobbered?  If so,
>> let's say *THAT* instead of just "exposed".  "Exposed" could mean "VMM
>> can read".
>>
>> Also, this just told me that this function can't be used to talk to the
>> VMM.  Why is this talking about exposure to the VMM?
> 
> Although __tdcall() is only used to communicate with the TDX module and the
> TDX module is not supposed to touch these registers, just to be on the safe
> side, I have tried to save the context of registers R12-R15. Anyway cycles
> used by instructions are less compared to tdcall.

Why are you talking about the VMM if this is a call to the SEAM module?

Let's say someone is reading the TDCALL architecture spec.  It will say
something like, "blah blah, in this case TDCALL will not modify
%r12->%r15".  Then someone goes and looks at this code that basically
says (or implies) "save these before the SEAM module modifies them".
What is a coder to do?

Please remove the ambiguity, either by removing this superfluous
(according to the spec) code, or documenting why it is not superfluous.

>>> +    /* Move TDCALL Leaf ID to RAX */
>>> +    mov %rdi, %rax
>>> +    /* Move output pointer to R12 */
>>> +    mov %r9, %r12
>>
>> I thought 'struct tdcall_output' was a purely software construct.  Why
>> are we passing a pointer to it into TDCALL?
> 
> Its used to store the TDCALL result (RCX, RDX, R8-R11). As far as this
> function is concerned, its just a block of memory (accessed using
> base address + TDCALL_r* offsets).

Is 'struct tdcall_output' a hardware architectural structure or a
software structure?

If it's a software structure, then why are we passing a pointer to a
software structure into a hardware ABI?

If it's a hardware architecture structure, where is the documentation
for it?

>>> +    tdcall
>>> +
>>> +    /* Check for TDCALL success: 0 - Successful, otherwise failed */
>>> +    test %rax, %rax
>>> +    jnz 1f
>>> +
>>> +    /* Check for a TDCALL output struct */
>>> +    test %r12, %r12
>>> +    jz 1f
>>
>> Does some universal status come back in r12?  Aren't we dealing with a
>> VMM/SEAM-controlled register here?  Isn't this dangerous?
> 
> R12 is the temporary register we have used to store the address of user
> passed output pointer. We just check for NULL condition here. R12 will
> not be used by the TDX module.

OK, so how do you know this?  Could you share your logic, please?

> If you prefer, we can just push the output pointer to stack and get it
> after we make the tdcall.

I prefer that the code be understandable and be written for a clear
purpose.  If you're using r12 for temporary storage, I expect to see at
least one reference *SOMEWHERE* to its use as temporary storage.  Right
now.... nothing.

>>> +    /* Copy TDCALL result registers to output struct: */
>>> +    movq %rcx, TDCALL_rcx(%r12)
>>> +    movq %rdx, TDCALL_rdx(%r12)
>>> +    movq %r8,  TDCALL_r8(%r12)
>>> +    movq %r9,  TDCALL_r9(%r12)
>>> +    movq %r10, TDCALL_r10(%r12)
>>> +    movq %r11, TDCALL_r11(%r12)
>>> +1:
>>> +    /* Zero out registers exposed to the TDX Module. */
>>> +    xor %rcx,  %rcx
>>> +    xor %rdx,  %rdx
>>> +    xor %r8d,  %r8d
>>> +    xor %r9d,  %r9d
>>> +    xor %r10d, %r10d
>>> +    xor %r11d, %r11d
>>
>> ... why?
> 
> These registers are used by the TDX Module. Why pass the stale values
> back to the user? So we clear them here.

Please go look at some other assembly code in the kernel called from C.
 Do those functions do this?  Why?  Why not?  Do they care about
"passing stale values back up"?

>>> +SYM_CODE_START_LOCAL(do_tdvmcall)
>>> +    FRAME_BEGIN
>>> +
>>> +    /* Save non-volatile GPRs that are exposed to the VMM. */
>>> +    push %r15
>>> +    push %r14
>>> +    push %r13
>>> +    push %r12
>>> +
>>> +    /* Set TDCALL leaf ID to TDVMCALL (0) in RAX */
>>
>> I think there needs to be some discussion of what TDCALL and TDVMCALL
>> are.  They are named too similarly not to do so.
> 
> TDVMCALL is the sub function of TDCALL (selected by setting RAX register
> to 0). TDVMCALL is used to request services from VMM.

Actually, I think these functions are horribly misnamed.

I think we should make them

	__tdx_seam_call()
or	__tdx_module_call()

and

	__tdx_hypercall()


	__tdcall()
and
	__tdvmcall()

are really nonsensical in this context, especially since TDVMCALL is
implemented with the TDCALL instruction, but not the __tdcall() function.

>>> +/* Helper function for standard type of TDVMCALL */
>>> +SYM_FUNC_START(__tdvmcall)
>>> +    /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
>>> +    xor %r10, %r10
>>> +    call do_tdvmcall
>>> +    retq
>>> +SYM_FUNC_END(__tdvmcall)
>>
>> Why do we need this helper?  Why does it need to be in assembly?
> 
> Its simpler to do it in assembly. Also, grouping all register updates
> in the same file will make it easier for us to read or debug issues.
> Another
> reason is, we also call do_tdvmcall() from in/out instruction use case.

Sathya, I seem to have to reverse-engineer what you are doing for all
this stuff.  Your answers to my questions are almost entirely orthogonal
to the things I really want to know.  I guess I need to be more precise
with the questions I'm asking.  But, this is yet another case where I
think the burden for this series continues to fall on the reviewer
rather than the submitter.  Not the way I think it is best.

So, trying to reverse-engineer what you are doing here... it seems that
you can't *practically* call do_tdvmcall() directly because %r10 would
be garbage.  That makes this (or a wrapper like it) required for every
practical call to do_tdvmcall().

But, even if that's the case, you need to *DOCUMENT* that up in
do_tdvmcall(): Hey, this function is worthless without something that
sets up %r10 before calling it.

I'm also not *SURE* this is simpler to do in assembly.

>>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>>> index 6a7193fead08..29c52128b9c0 100644
>>> --- a/arch/x86/kernel/tdx.c
>>> +++ b/arch/x86/kernel/tdx.c
>>> @@ -1,8 +1,44 @@
>>>   // SPDX-License-Identifier: GPL-2.0
>>>   /* Copyright (C) 2020 Intel Corporation */
>>>   +#define pr_fmt(fmt) "TDX: " fmt
>>> +
>>>   #include <asm/tdx.h>
>>>   +/*
>>> + * Wrapper for use case that checks for error code and print warning
>>> message.
>>> + */
>>
>> This comment isn't very useful.  I can see the error check and warning
>> by reading the code.
> 
> Its just a helper function that covers common case of checking for error
> and print the warning message. If this comment is superfluous, I can remove
> it.

I'd prefer that you actually write a comment about what the function is
doing, maybe:

/*
 * Wrapper for simple hypercalls that only return a success/error code.
 */

... or *SOMETHING* that tells what its purpose in life is.

>>> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>>> +{
>>> +    u64 err;
>>> +
>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
>>> +
>>> +    if (err)
>>> +        pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>>> +                    fn, err);
>>> +
>>> +    return err;
>>> +}
>>> +
>>> +/*
>>> + * Wrapper for the semi-common case where we need single output
>>> value (R11).
>>> + */
>>> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64
>>> r14, u64 r15)
>>> +{
>>> +
>>> +    struct tdvmcall_output out = {0};
>>> +    u64 err;
>>> +
>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, &out);
>>> +
>>> +    if (err)
>>> +        pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>>> +                    fn, err);
>>> +
>>> +    return out.r11;
>>> +}
>>
>> How do callers check for errors?  Is the error value superfluously
>> returned in r11 and another output register?
> 
> We already check for error in this helper function. User of this function
> only cares about output value (R11). Mainly for in/out use case.

That's pretty valuable information.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-26 23:17       ` Dave Hansen
@ 2021-04-27  2:29         ` Kuppuswamy, Sathyanarayanan
  2021-04-27 14:29           ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-27  2:29 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 4/26/21 4:17 PM, Dave Hansen wrote:
> On 4/26/21 3:31 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
>>>> +
>>>> +/*
>>>> + * __tdcall()  - Used to communicate with the TDX module
>>>
>>> Why is this function here?  What does it do?  Why do we need it?
>>
>> __tdcall() function is used to request services from the TDX Module.
>> Example use cases are, TDREPORT, VEINFO, TDINFO, etc.
> 
> I think there might be some misinterpretation of my question.  What you
> are describing is what *TDCALL* does.  Why do we need a wrapper
> function?  What purpose does this wrapper function serve?  Why do we
> need this wrapper function?
> 

How about following explanation?

Helper function for "tdcall" instruction, which can be used to request
services from the TDX module (does not include VMM). Few examples of
valid TDX module services are, "TDREPORT", "MEM PAGE ACCEPT", "VEINFO",
etc.

This function serves as a wrapper to move user call arguments to
the correct registers as specified by "tdcall" ABI and shares it with
the TDX module.  If the "tdcall" operation is successful and a
valid "struct tdcall_out" pointer is available (in "out" argument),
output from the TDX module (RCX, RDX, R8-R11) is saved to the memory
specified in the "out" pointer. Also the status of the "tdcall"
operation is returned back to the user as a function return value.

>>> Why do we have to save these?  Because they might be clobbered?  If so,
>>> let's say *THAT* instead of just "exposed".  "Exposed" could mean "VMM
>>> can read".
>>>
>>> Also, this just told me that this function can't be used to talk to the
>>> VMM.  Why is this talking about exposure to the VMM?
>>
>> Although __tdcall() is only used to communicate with the TDX module and the
>> TDX module is not supposed to touch these registers, just to be on the safe
>> side, I have tried to save the context of registers R12-R15. Anyway cycles
>> used by instructions are less compared to tdcall.
> 
> Why are you talking about the VMM if this is a call to the SEAM module?
> 
> Let's say someone is reading the TDCALL architecture spec.  It will say
> something like, "blah blah, in this case TDCALL will not modify
> %r12->%r15".  Then someone goes and looks at this code that basically
> says (or implies) "save these before the SEAM module modifies them".
> What is a coder to do?
> 
> Please remove the ambiguity, either by removing this superfluous
> (according to the spec) code, or documenting why it is not superfluous.

Agree. I will remove the save/restore context code.

> 
>>>> +    /* Move TDCALL Leaf ID to RAX */
>>>> +    mov %rdi, %rax
>>>> +    /* Move output pointer to R12 */
>>>> +    mov %r9, %r12
>>>
>>> I thought 'struct tdcall_output' was a purely software construct.  Why
>>> are we passing a pointer to it into TDCALL?
>>
>> Its used to store the TDCALL result (RCX, RDX, R8-R11). As far as this
>> function is concerned, its just a block of memory (accessed using
>> base address + TDCALL_r* offsets).
> 
> Is 'struct tdcall_output' a hardware architectural structure or a
> software structure?
> 
> If it's a software structure, then why are we passing a pointer to a
> software structure into a hardware ABI?
> 
> If it's a hardware architecture structure, where is the documentation
> for it?
> 

I think there is a misunderstanding here. We don't share the tdcall_output
pointer with the TDX module. Current use cases of TDCALL (other than TDVMCALL)
do not use registers from R12-R15. Since the registers R12-R15 are free and
available, we are using R12 as temporary storage to hold the tdcall_output
pointer.

I will include some comment about using it as temporary storage.


> 
> I prefer that the code be understandable and be written for a clear
> purpose.  If you're using r12 for temporary storage, I expect to see at
> least one reference *SOMEWHERE* to its use as temporary storage.  Right
> now.... nothing.
> 

I will include some reference to it.

>>>> +    /* Copy TDCALL result registers to output struct: */
>>>> +    movq %rcx, TDCALL_rcx(%r12)
>>>> +    movq %rdx, TDCALL_rdx(%r12)
>>>> +    movq %r8,  TDCALL_r8(%r12)
>>>> +    movq %r9,  TDCALL_r9(%r12)
>>>> +    movq %r10, TDCALL_r10(%r12)
>>>> +    movq %r11, TDCALL_r11(%r12)
>>>> +1:
>>>> +    /* Zero out registers exposed to the TDX Module. */
>>>> +    xor %rcx,  %rcx
>>>> +    xor %rdx,  %rdx
>>>> +    xor %r8d,  %r8d
>>>> +    xor %r9d,  %r9d
>>>> +    xor %r10d, %r10d
>>>> +    xor %r11d, %r11d
>>>
>>> ... why?
>>
>> These registers are used by the TDX Module. Why pass the stale values
>> back to the user? So we clear them here.
> 
> Please go look at some other assembly code in the kernel called from C.
>   Do those functions do this?  Why?  Why not?  Do they care about
> "passing stale values back up"?
> 

Maybe I am being overly cautious here. Since TDX module is the trusted
code, speculation attack is not a consideration here. I will remove this
block of code.

>>>> +SYM_CODE_START_LOCAL(do_tdvmcall)
>>>> +    FRAME_BEGIN
>>>> +
>>>> +    /* Save non-volatile GPRs that are exposed to the VMM. */
>>>> +    push %r15
>>>> +    push %r14
>>>> +    push %r13
>>>> +    push %r12
>>>> +
>>>> +    /* Set TDCALL leaf ID to TDVMCALL (0) in RAX */
>>>
>>> I think there needs to be some discussion of what TDCALL and TDVMCALL
>>> are.  They are named too similarly not to do so.
>>
>> TDVMCALL is the sub function of TDCALL (selected by setting RAX register
>> to 0). TDVMCALL is used to request services from VMM.
> 
> Actually, I think these functions are horribly misnamed.
> 
> I think we should make them
> 
> 	__tdx_seam_call()
> or	__tdx_module_call()
> 
> and
> 
> 	__tdx_hypercall()
> 
> 
> 	__tdcall()
> and
> 	__tdvmcall()
> 
> are really nonsensical in this context, especially since TDVMCALL is
> implemented with the TDCALL instruction, but not the __tdcall() function.
> 

TDVMCALL is a short form of "TDG.VP.VMCALL". This term usage came from
GHCI document. We can read it as "Trusted Domain VMCALL". Maybe
because we are used to GHCI spec, we don't find it confusing. I agree
that if you consider the "tdcall" instruction usage, it is confusing.

But if it's confusing for new readers and rename is preferred,

Do we need to rename the helper functions ?

tdvmcall(), tdvmcall_out_r11()

Also what about output structs?

struct tdcall_output
struct tdvmcall_output

>>>> +/* Helper function for standard type of TDVMCALL */
>>>> +SYM_FUNC_START(__tdvmcall)
>>>> +    /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
>>>> +    xor %r10, %r10
>>>> +    call do_tdvmcall
>>>> +    retq
>>>> +SYM_FUNC_END(__tdvmcall)
>>>
>>> Why do we need this helper?  Why does it need to be in assembly?
>>
>> Its simpler to do it in assembly. Also, grouping all register updates
>> in the same file will make it easier for us to read or debug issues.
>> Another
>> reason is, we also call do_tdvmcall() from in/out instruction use case.
> 
> Sathya, I seem to have to reverse-engineer what you are doing for all
> this stuff.  Your answers to my questions are almost entirely orthogonal
> to the things I really want to know.  I guess I need to be more precise
> with the questions I'm asking.  But, this is yet another case where I
> think the burden for this series continues to fall on the reviewer
> rather than the submitter.  Not the way I think it is best.

I have assumed that you are aware of reason for the existence of
do_tdvmcall() helper function. It is mainly created to hold common
code between vendor specific and standard type of tdvmcall's.

But it is a mistake from my end. I will try to be elaborate in my
future replies.

> 
> So, trying to reverse-engineer what you are doing here... it seems that
> you can't *practically* call do_tdvmcall() directly because %r10 would
> be garbage.  That makes this (or a wrapper like it) required for every
> practical call to do_tdvmcall().
> 
> But, even if that's the case, you need to *DOCUMENT* that up in
> do_tdvmcall(): Hey, this function is worthless without something that
> sets up %r10 before calling it.

Agree. This needs to be documented. I will add it in next version.

> 
> I'm also not *SURE* this is simpler to do in assembly.
> 
>>>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>>>> index 6a7193fead08..29c52128b9c0 100644
>>>> --- a/arch/x86/kernel/tdx.c
>>>> +++ b/arch/x86/kernel/tdx.c
>>>> @@ -1,8 +1,44 @@
>>>>    // SPDX-License-Identifier: GPL-2.0
>>>>    /* Copyright (C) 2020 Intel Corporation */
>>>>    +#define pr_fmt(fmt) "TDX: " fmt
>>>> +
>>>>    #include <asm/tdx.h>
>>>>    +/*
>>>> + * Wrapper for use case that checks for error code and print warning
>>>> message.
>>>> + */
>>>
>>> This comment isn't very useful.  I can see the error check and warning
>>> by reading the code.
>>
>> Its just a helper function that covers common case of checking for error
>> and print the warning message. If this comment is superfluous, I can remove
>> it.
> 
> I'd prefer that you actually write a comment about what the function is
> doing, maybe:
> 
> /*
>   * Wrapper for simple hypercalls that only return a success/error code.
>   */
> 
> ... or *SOMETHING* that tells what its purpose in life is.

I will fix it in next version.

> 
>>>> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>>>> +{
>>>> +    u64 err;
>>>> +
>>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
>>>> +
>>>> +    if (err)
>>>> +        pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>>>> +                    fn, err);
>>>> +
>>>> +    return err;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Wrapper for the semi-common case where we need single output
>>>> value (R11).
>>>> + */
>>>> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64
>>>> r14, u64 r15)
>>>> +{
>>>> +
>>>> +    struct tdvmcall_output out = {0};
>>>> +    u64 err;
>>>> +
>>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, &out);
>>>> +
>>>> +    if (err)
>>>> +        pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>>>> +                    fn, err);
>>>> +
>>>> +    return out.r11;
>>>> +}
>>>
>>> How do callers check for errors?  Is the error value superfluously
>>> returned in r11 and another output register?
>>
>> We already check for error in this helper function. User of this function
>> only cares about output value (R11). Mainly for in/out use case.
> 
> That's pretty valuable information.

I will include this note in the function comment.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-27  2:29         ` Kuppuswamy, Sathyanarayanan
@ 2021-04-27 14:29           ` Dave Hansen
  2021-04-27 19:18             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-04-27 14:29 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 7:29 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 4/26/21 4:17 PM, Dave Hansen wrote:
>> On 4/26/21 3:31 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>>> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
>>>>> +
>>>>> +/*
>>>>> + * __tdcall()  - Used to communicate with the TDX module
>>>>
>>>> Why is this function here?  What does it do?  Why do we need it?
>>>
>>> __tdcall() function is used to request services from the TDX Module.
>>> Example use cases are, TDREPORT, VEINFO, TDINFO, etc.
>>
>> I think there might be some misinterpretation of my question.  What you
>> are describing is what *TDCALL* does.  Why do we need a wrapper
>> function?  What purpose does this wrapper function serve?  Why do we
>> need this wrapper function?
>>
> How about following explanation?
> 
> Helper function for "tdcall" instruction, which can be used to request
> services from the TDX module (does not include VMM). Few examples of
> valid TDX module services are, "TDREPORT", "MEM PAGE ACCEPT", "VEINFO",
> etc.

Naming the services here is not useful.  If I want to know who calls
this, I'll just literally do that: look up the callers of this function.

> This function serves as a wrapper to move user call arguments to
> the correct registers as specified by "tdcall" ABI and shares it with
> the TDX module.  If the "tdcall" operation is successful and a
> valid "struct tdcall_out" pointer is available (in "out" argument),
> output from the TDX module (RCX, RDX, R8-R11) is saved to the memory
> specified in the "out" pointer. Also the status of the "tdcall"
> operation is returned back to the user as a function return value.

I tend to prefer function comments that talk high-level about what the
function does rather than waste space on the exact registers used in the
ABI.  I also tend not to talk about things that can be trivially grepped
for, like the callers of this function.

I'd trim the fat out of there, but it's generally OK, although too
rotund for my taste.

>>>>> +    /* Move TDCALL Leaf ID to RAX */
>>>>> +    mov %rdi, %rax
>>>>> +    /* Move output pointer to R12 */
>>>>> +    mov %r9, %r12
>>>>
>>>> I thought 'struct tdcall_output' was a purely software construct.  Why
>>>> are we passing a pointer to it into TDCALL?
>>>
>>> Its used to store the TDCALL result (RCX, RDX, R8-R11). As far as this
>>> function is concerned, its just a block of memory (accessed using
>>> base address + TDCALL_r* offsets).
>>
>> Is 'struct tdcall_output' a hardware architectural structure or a
>> software structure?
>>
>> If it's a software structure, then why are we passing a pointer to a
>> software structure into a hardware ABI?
>>
>> If it's a hardware architecture structure, where is the documentation
>> for it?
>>
> 
> I think there is a misunderstanding here. We don't share the tdcall_output
> pointer with the TDX module. Current use cases of TDCALL (other than
> TDVMCALL)
> do not use registers from R12-R15. Since the registers R12-R15 are free and
> available, we are using R12 as temporary storage to hold the tdcall_output
> pointer.

In other words, 'struct tdcall_output' is a purely software concept.
However, its pointer is manipulated literally next to all of the TDCALL
register arguments and it has an *IDENTICAL* comment to all of those
other moves.

Please make it clear that %r12 is not being used at all for the TDCALL
instruction itself.

But, the bigger point here is that the code needs to be structured in a
way that makes the function and interactions clear.  If I want to know
more about the "output pointer", where do I go?  Do I go looking at the
calling functions or the TDINFO instruction reference?

...
>> Please go look at some other assembly code in the kernel called from C.
>>   Do those functions do this?  Why?  Why not?  Do they care about
>> "passing stale values back up"?
> 
> Maybe I am being overly cautious here. Since TDX module is the trusted
> code, speculation attack is not a consideration here. I will remove this
> block of code.

Caution is OK.  Caution without explanation somewhere as to why it is
warranted is not.

> Do we need to rename the helper functions ?
> 
> tdvmcall(), tdvmcall_out_r11()

Yes.

> Also what about output structs?
> 
> struct tdcall_output
> struct tdvmcall_output

Yes, they need sane, straightforward names which are not confusing too.

>>>>> +/* Helper function for standard type of TDVMCALL */
>>>>> +SYM_FUNC_START(__tdvmcall)
>>>>> +    /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
>>>>> +    xor %r10, %r10
>>>>> +    call do_tdvmcall
>>>>> +    retq
>>>>> +SYM_FUNC_END(__tdvmcall)
>>>>
>>>> Why do we need this helper?  Why does it need to be in assembly?
>>>
>>> Its simpler to do it in assembly. Also, grouping all register updates
>>> in the same file will make it easier for us to read or debug issues.
>>> Another
>>> reason is, we also call do_tdvmcall() from in/out instruction use case.
>>
>> Sathya, I seem to have to reverse-engineer what you are doing for all
>> this stuff.  Your answers to my questions are almost entirely orthogonal
>> to the things I really want to know.  I guess I need to be more precise
>> with the questions I'm asking.  But, this is yet another case where I
>> think the burden for this series continues to fall on the reviewer
>> rather than the submitter.  Not the way I think it is best.
> 
> I have assumed that you are aware of reason for the existence of
> do_tdvmcall() helper function. It is mainly created to hold common
> code between vendor specific and standard type of tdvmcall's.

No, I was not aware of that.  Remember, you're not doing this for *ME*.
 You're doing it for the hundred other people that are going to look
over the code and who won't have been aware of your reasoning.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-04-26 18:01 ` [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL Kuppuswamy Sathyanarayanan
@ 2021-04-27 17:31   ` Borislav Petkov
  2021-05-06 14:59     ` Kirill A. Shutemov
  2021-05-10  8:07     ` Juergen Gross
  0 siblings, 2 replies; 381+ messages in thread
From: Borislav Petkov @ 2021-04-27 17:31 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel, Jürgen Gross

+ Jürgen.

On Mon, Apr 26, 2021 at 11:01:28AM -0700, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Split off halt paravirt calls from CONFIG_PARAVIRT_XXL into
> a separate config option. It provides a middle ground for
> not-so-deep paravirtulized environments.

Please introduce a spellchecker into your patch creation workflow.

Also, what does "not-so-deep" mean?

> CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
> calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
> config would be a bloat for TDX.

Used how? Why is it bloat for TDX?

I'm sure that'll become clear in the remainder of the patches but you
should state it here so that it is clear why you're doing what you're
doing.

> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/Kconfig                      |  4 +++
>  arch/x86/boot/compressed/misc.h       |  1 +
>  arch/x86/include/asm/irqflags.h       | 38 +++++++++++++++------------
>  arch/x86/include/asm/paravirt.h       | 22 +++++++++-------
>  arch/x86/include/asm/paravirt_types.h |  3 ++-
>  arch/x86/kernel/paravirt.c            |  4 ++-
>  arch/x86/mm/mem_encrypt_identity.c    |  1 +
>  7 files changed, 44 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2792879d398e..6b4b682af468 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -783,8 +783,12 @@ config PARAVIRT
>  	  over full virtualization.  However, when run without a hypervisor
>  	  the kernel is theoretically slower and slightly larger.
>  
> +config PARAVIRT_XL
> +	bool
> +
>  config PARAVIRT_XXL
>  	bool
> +	select PARAVIRT_XL
>  
>  config PARAVIRT_DEBUG
>  	bool "paravirt-ops debugging"
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index 901ea5ebec22..4b84abe43765 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -9,6 +9,7 @@
>   * paravirt and debugging variants are added.)
>   */
>  #undef CONFIG_PARAVIRT
> +#undef CONFIG_PARAVIRT_XL
>  #undef CONFIG_PARAVIRT_XXL

So what happens if someone else needs even less pv and defines
CONFIG_PARAVIRT_L. Or _M? Or _S?

Are we going to teleport into a clothing store each time we look at
paravirt now? :)

So before this goes out of hand let's define explicitly, pls, what
XXL means and XL. And rename them. They could be called PARAVIRT_FULL
and PARAVIRT_HLT as apparently that thing is exposing only the PV ops
related to HLT.

Or something to that effect.

Dunno, maybe Jürgen has a better idea, leaving in the rest quoted for him.

Thx.

>  #undef CONFIG_PARAVIRT_SPINLOCKS
>  #undef CONFIG_KASAN
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index 144d70ea4393..1688841893d7 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -59,27 +59,11 @@ static inline __cpuidle void native_halt(void)
>  
>  #endif
>  
> -#ifdef CONFIG_PARAVIRT_XXL
> +#ifdef CONFIG_PARAVIRT_XL
>  #include <asm/paravirt.h>
>  #else
>  #ifndef __ASSEMBLY__
>  #include <linux/types.h>
> -
> -static __always_inline unsigned long arch_local_save_flags(void)
> -{
> -	return native_save_fl();
> -}
> -
> -static __always_inline void arch_local_irq_disable(void)
> -{
> -	native_irq_disable();
> -}
> -
> -static __always_inline void arch_local_irq_enable(void)
> -{
> -	native_irq_enable();
> -}
> -
>  /*
>   * Used in the idle loop; sti takes one instruction cycle
>   * to complete:
> @@ -97,6 +81,26 @@ static inline __cpuidle void halt(void)
>  {
>  	native_halt();
>  }
> +#endif /* !__ASSEMBLY__ */
> +#endif /* CONFIG_PARAVIRT_XL */
> +
> +#ifndef CONFIG_PARAVIRT_XXL
> +#ifndef __ASSEMBLY__
> +
> +static __always_inline unsigned long arch_local_save_flags(void)
> +{
> +	return native_save_fl();
> +}
> +
> +static __always_inline void arch_local_irq_disable(void)
> +{
> +	native_irq_disable();
> +}
> +
> +static __always_inline void arch_local_irq_enable(void)
> +{
> +	native_irq_enable();
> +}
>  
>  /*
>   * For spinlocks, etc:
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index 4abf110e2243..2dbb6c9c7e98 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -84,6 +84,18 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
>  	PVOP_VCALL1(mmu.exit_mmap, mm);
>  }
>  
> +#ifdef CONFIG_PARAVIRT_XL
> +static inline void arch_safe_halt(void)
> +{
> +	PVOP_VCALL0(irq.safe_halt);
> +}
> +
> +static inline void halt(void)
> +{
> +	PVOP_VCALL0(irq.halt);
> +}
> +#endif
> +
>  #ifdef CONFIG_PARAVIRT_XXL
>  static inline void load_sp0(unsigned long sp0)
>  {
> @@ -145,16 +157,6 @@ static inline void __write_cr4(unsigned long x)
>  	PVOP_VCALL1(cpu.write_cr4, x);
>  }
>  
> -static inline void arch_safe_halt(void)
> -{
> -	PVOP_VCALL0(irq.safe_halt);
> -}
> -
> -static inline void halt(void)
> -{
> -	PVOP_VCALL0(irq.halt);
> -}
> -
>  static inline void wbinvd(void)
>  {
>  	PVOP_VCALL0(cpu.wbinvd);
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index de87087d3bde..5261fba47ba5 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -177,7 +177,8 @@ struct pv_irq_ops {
>  	struct paravirt_callee_save save_fl;
>  	struct paravirt_callee_save irq_disable;
>  	struct paravirt_callee_save irq_enable;
> -
> +#endif
> +#ifdef CONFIG_PARAVIRT_XL
>  	void (*safe_halt)(void);
>  	void (*halt)(void);
>  #endif
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index c60222ab8ab9..d6d0b363fe70 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -322,9 +322,11 @@ struct paravirt_patch_template pv_ops = {
>  	.irq.save_fl		= __PV_IS_CALLEE_SAVE(native_save_fl),
>  	.irq.irq_disable	= __PV_IS_CALLEE_SAVE(native_irq_disable),
>  	.irq.irq_enable		= __PV_IS_CALLEE_SAVE(native_irq_enable),
> +#endif /* CONFIG_PARAVIRT_XXL */
> +#ifdef CONFIG_PARAVIRT_XL
>  	.irq.safe_halt		= native_safe_halt,
>  	.irq.halt		= native_halt,
> -#endif /* CONFIG_PARAVIRT_XXL */
> +#endif /* CONFIG_PARAVIRT_XL */
>  
>  	/* Mmu ops. */
>  	.mmu.flush_tlb_user	= native_flush_tlb_local,
> diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
> index 6c5eb6f3f14f..20d0cb116557 100644
> --- a/arch/x86/mm/mem_encrypt_identity.c
> +++ b/arch/x86/mm/mem_encrypt_identity.c
> @@ -24,6 +24,7 @@
>   * be extended when new paravirt and debugging variants are added.)
>   */
>  #undef CONFIG_PARAVIRT
> +#undef CONFIG_PARAVIRT_XL
>  #undef CONFIG_PARAVIRT_XXL
>  #undef CONFIG_PARAVIRT_SPINLOCKS
>  
> -- 
> 2.25.1
> 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-27 14:29           ` Dave Hansen
@ 2021-04-27 19:18             ` Kuppuswamy, Sathyanarayanan
  2021-04-27 19:20               ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-27 19:18 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

Hi Dave,

On 4/27/21 7:29 AM, Dave Hansen wrote:
>> Do we need to rename the helper functions ?
>>
>> tdvmcall(), tdvmcall_out_r11()
> Yes.
> 
>> Also what about output structs?
>>
>> struct tdcall_output
>> struct tdvmcall_output
> Yes, they need sane, straightforward names which are not confusing too.
> 

Following is the rename diff. Please let me know if you agree with the
names used.

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 6c3c71bb57a0..95a6a6c6061a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h

-struct tdcall_output {
+struct tdx_module_output {
         u64 rcx;
         u64 rdx;
         u64 r8;
@@ -19,7 +19,7 @@ struct tdcall_output {
         u64 r11;
  };

-struct tdvmcall_output {
+struct tdx_hypercall_output {
         u64 r11;
         u64 r12;
         u64 r13;
@@ -33,12 +33,12 @@ bool is_tdx_guest(void);
  void __init tdx_early_init(void);

  /* Helper function used to communicate with the TDX module */
-u64 __tdcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
-            struct tdcall_output *out);
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+                     struct tdx_module_output *out);

  /* Helper function used to request services from VMM */
-u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
-              struct tdvmcall_output *out);
+u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+                   struct tdx_hypercall_output *out);

--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -8,11 +8,11 @@
  /*
   * Wrapper for use case that checks for error code and print warning message.
   */
-static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+static inline u64 tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
  {
         u64 err;

-       err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
+       err = __tdx_hypercall(fn, r12, r13, r14, r15, NULL);

         if (err)
                 pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
@@ -24,13 +24,14 @@ static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
  /*
   * Wrapper for the semi-common case where we need single output value (R11).
   */
-static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+static inline u64 tdx_hypercall_out_r11(u64 fn, u64 r12, u64 r13,
+                                       u64 r14, u64 r15)


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-27 19:18             ` Kuppuswamy, Sathyanarayanan
@ 2021-04-27 19:20               ` Dave Hansen
  2021-04-28 17:42                 ` [PATCH v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() " Kuppuswamy Sathyanarayanan
  2021-05-19  5:58                 ` [RFC v2-fix-v1 " Kuppuswamy Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-04-27 19:20 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/27/21 12:18 PM, Kuppuswamy, Sathyanarayanan wrote:
> Following is the rename diff. Please let me know if you agree with the
> names used.

Look fine at a glance, but the real key is how they look when they get used.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* [PATCH v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-04-27 19:20               ` Dave Hansen
@ 2021-04-28 17:42                 ` Kuppuswamy Sathyanarayanan
  2021-05-19  5:58                 ` [RFC v2-fix-v1 " Kuppuswamy Sathyanarayanan
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-28 17:42 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
expose guest state to the host.  This prevents the old hypercall
mechanisms from working. So to communicate with VMM, TDX
specification defines a new instruction called "tdcall".

In TDX based VM, since VMM is an untrusted entity, a intermediary
layer (TDX module) exists between host and guest to facilitate the
secure communication. And "tdcall" instruction  is used by the guest
to request services from TDX module. And a variant of "tdcall"
instruction (with specific arguments as defined by GHCI) is used by
the guest to request services from  VMM via the TDX module.

Implement common helper functions to communicate with the TDX Module
and VMM (using TDCALL instruction).
   
__tdx_hypercall()    - function can be used to request services from
		       the VMM.
__tdx_module_call()  - function can be used to communicate with the
		       TDX Module.

Also define two additional wrappers, tdx_hypercall() and
tdx_hypercall_out_r11() to cover common use cases of
__tdx_hypercall() function. Since each use case of
__tdx_module_call() is different, we don't need such wrappers for it.

Implement __tdx_module_call() and __tdx_hypercall() helper functions
in assembly.

Rationale behind choosing to use assembly over inline assembly are,

1. Since the number of lines of instructions (with comments) in
__tdx_hypercall() implementation is over 70, using inline assembly
to implement it will make it hard to read.
   
2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
TDCALL operation, if all these registers are included in in-line
assembly constraints, some of the older compilers may not
be able to meet this requirement.

Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
use the same set of argument registers. The implementation here picks
the current worst-case scenario for TDCALL (4 registers). For TDCALLs
with fewer than 4 arguments, there will end up being a few superfluous
(cheap) instructions.  But, this approach maximizes code reuse. The
same argument applies to __tdx_hypercall() function as well.

Current implementation of __tdx_hypercall() includes error handling
(ud2 on failure case) in assembly function instead of doing it in C
wrapper function. The reason behind this choice is, when adding support
for in/out instructions (refer to patch titled "x86/tdx: Handle port
I/O" in this series), we use alternative_io() to substitute in/out
instruction with  __tdx_hypercall() calls. So use of C wrappers is not
trivial in this case because the input parameters will be in the wrong
registers and it's tricky to include proper buffer code to make this
happen.

For registers used by TDCALL instruction, please check TDX GHCI
specification, sec 2.4 and 3.

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Originally-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Hi Dave,

It includes all fixes suggested by you. Please let me know your
comments.

Changes since v1:
 * Renamed __tdcall()/__tdvmcall() to
   __tdx_module_call()/__tdx_hypercall().
 * Used BIT() to derive TDVMCALL_EXPOSE_REGS_MASK.
 * Removed unnecessary code in __tdcall() function.
 * Fixed comments as per Dave's review.

 arch/x86/include/asm/tdx.h    |  26 ++++
 arch/x86/kernel/Makefile      |   2 +-
 arch/x86/kernel/asm-offsets.c |  22 ++++
 arch/x86/kernel/tdcall.S      | 215 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c         |  39 ++++++
 5 files changed, 303 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..95a6a6c6061a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,38 @@
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
+#include <linux/types.h>
+
+struct tdx_module_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+struct tdx_hypercall_output {
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+};
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
 void __init tdx_early_init(void);
 
+/* Helper function used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		      struct tdx_module_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+		    struct tdx_hypercall_output *out);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..e6b3bb983992 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
 #include <xen/interface/xen.h>
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
 #ifdef CONFIG_X86_32
 # include "asm-offsets_32.c"
 #else
@@ -75,6 +79,24 @@ static void __used common(void)
 	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+	BLANK();
+	/* Offset for fields in tdcall_output */
+	OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+	OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+	OFFSET(TDX_MODULE_r8,  tdx_module_output, r8);
+	OFFSET(TDX_MODULE_r9,  tdx_module_output, r9);
+	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
+	/* Offset for fields in tdvmcall_output */
+	OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
+	OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
+	OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
+	OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
+	OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
+#endif
+
 	BLANK();
 	OFFSET(BP_scratch, boot_params, scratch);
 	OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..7e14b4a2312e
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,215 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+
+#define TDG_R10		BIT(10)
+#define TDG_R11		BIT(11)
+#define TDG_R12		BIT(12)
+#define TDG_R13		BIT(13)
+#define TDG_R14		BIT(14)
+#define TDG_R15		BIT(15)
+
+/*
+ * Expose registers R10-R15 to VMM. It is passed via RCX register
+ * to the TDX Module, which will be used by the TDX module to
+ * identify the list of registers exposed to VMM. Each bit in this
+ * mask represents a register ID. You can find the bit field details
+ * in TDX GHCI specification.
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK	( TDG_R10 | TDG_R11 | \
+					  TDG_R12 | TDG_R13 | \
+					  TDG_R14 | TDG_R15 )
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdx_module_call()  - Helper function used by TDX guests to request
+ * services from the TDX module (does not include VMM services).
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with the
+ * TDX module.  And if the "tdcall" operation is successful and a valid
+ * "struct tdx_module_output" pointer is available (in "out" argument),
+ * output from the TDX module is saved to the memory specified in the
+ * "out" pointer. Also the status of the "tdcall" operation is returned
+ * back to the user as a function return value.
+ *
+ * @fn  (RDI)		- TDCALL Leaf ID,    moved to RAX
+ * @rcx (RSI)		- Input parameter 1, moved to RCX
+ * @rdx (RDX)		- Input parameter 2, moved to RDX
+ * @r8  (RCX)		- Input parameter 3, moved to R8
+ * @r9  (R8)		- Input parameter 4, moved to R9
+ *
+ * @out (R9)		- struct tdx_module_output pointer
+ *			  stored temporarily in R12 (not
+ * 			  shared with the TDX module)
+ *
+ * Return status of tdcall via RAX.
+ *
+ * NOTE: This function should not be used for TDX hypercall
+ *       use cases.
+ */
+SYM_FUNC_START(__tdx_module_call)
+	FRAME_BEGIN
+
+	/*
+	 * R12 will be used as temporary storage for
+	 * struct tdx_module_output pointer. You can
+	 * find struct tdx_module_output details in
+	 * arch/x86/include/asm/tdx.h. Also note that
+	 * registers R12-R15 are not used by TDCALL
+	 * services supported by this helper function.
+	 */
+	push %r12	/* Callee saved, so preserve it */
+	mov %r9,  %r12 	/* Move output pointer to R12 */
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	mov %rdi, %rax	/* Move TDCALL Leaf ID to RAX */
+	mov %r8,  %r9	/* Move input 4 to R9 */
+	mov %rcx, %r8	/* Move input 3 to R8 */
+	mov %rsi, %rcx	/* Move input 1 to RCX */
+	/* Leave input param 2 in RDX */
+
+	tdcall
+
+	/* Check for TDCALL success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check for TDCALL output struct != NULL */
+	test %r12, %r12
+	jz 1f
+
+	/* Copy TDCALL result registers to output struct: */
+	movq %rcx, TDX_MODULE_rcx(%r12)
+	movq %rdx, TDX_MODULE_rdx(%r12)
+	movq %r8,  TDX_MODULE_r8(%r12)
+	movq %r9,  TDX_MODULE_r9(%r12)
+	movq %r10, TDX_MODULE_r10(%r12)
+	movq %r11, TDX_MODULE_r11(%r12)
+1:
+	pop %r12 /* Restore the state of R12 register */
+
+	FRAME_END
+	ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * do_tdx_hypercall()  - Helper function used by TDX guests to request
+ * services from the VMM. All requests are made via the TDX module
+ * using "TDCALL" instruction.
+ *
+ * This function is created to contain common between vendor specific
+ * and standard type tdx hypercalls. So the caller of this function had
+ * to set the TDVMCALL type in the R10 register before calling it.
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with VMM
+ * via the TDX module. And if the "tdcall" operation is successful and a
+ * valid "struct tdx_hypercall_output" pointer is available (in "out"
+ * argument), output from the VMM is saved to the memory specified in the
+ * "out" pointer. 
+ *
+ * @fn  (RDI)		- TDVMCALL function, moved to R11
+ * @r12 (RSI)		- Input parameter 1, moved to R12
+ * @r13 (RDX)		- Input parameter 2, moved to R13
+ * @r14 (RCX)		- Input parameter 3, moved to R14
+ * @r15 (R8)		- Input parameter 4, moved to R15
+ *
+ * @out (R9)		- struct tdx_hypercall_output pointer
+ *
+ * On successful completion, return TDX hypercall error code.
+ * If the "tdcall" operation fails, panic.
+ *
+ */
+SYM_CODE_START_LOCAL(do_tdx_hypercall)
+	FRAME_BEGIN
+
+	/* Save non-volatile GPRs that are exposed to the VMM. */
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+
+	/* Leave hypercall output pointer in R9, it's not clobbered by VMM */
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	xor %eax, %eax /* Move TDCALL leaf ID (TDVMCALL (0)) to RAX */
+	mov %rdi, %r11 /* Move TDVMCALL function id to R11 */
+	mov %rsi, %r12 /* Move input 1 to R12 */
+	mov %rdx, %r13 /* Move input 2 to R13 */
+	mov %rcx, %r14 /* Move input 1 to R14 */
+	mov %r8,  %r15 /* Move input 1 to R15 */
+	/* Caller of do_tdx_hypercall() will set TDVMCALL type in R10 */
+
+	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+	tdcall
+
+	/*
+	 * Check for TDCALL success: 0 - Successful, otherwise failed.
+	 * If failed, there is an issue with TDX Module which is fatal
+	 * for the guest. So panic. Also note that RAX is controlled
+	 * only by the TDX module and not exposed to VMM.
+	 */
+	test %rax, %rax
+	jnz 2f
+
+	/* Move hypercall error code to RAX to return to user */
+	mov %r10, %rax
+
+	/* Check for hypercall success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check for hypercall output struct != NULL */
+	test %r9, %r9
+	jz 1f
+
+	/* Copy hypercall result registers to output struct: */
+	movq %r11, TDX_HYPERCALL_r11(%r9)
+	movq %r12, TDX_HYPERCALL_r12(%r9)
+	movq %r13, TDX_HYPERCALL_r13(%r9)
+	movq %r14, TDX_HYPERCALL_r14(%r9)
+	movq %r15, TDX_HYPERCALL_r15(%r9)
+1:
+	/*
+	 * Zero out registers exposed to the VMM to avoid
+	 * speculative execution with VMM-controlled values.
+	 */
+	xor %r10d, %r10d
+	xor %r11d, %r11d
+	xor %r12d, %r12d
+	xor %r13d, %r13d
+	xor %r14d, %r14d
+	xor %r15d, %r15d
+
+	/* Restore non-volatile GPRs that are exposed to the VMM. */
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	FRAME_END
+	ret
+2:
+	ud2
+SYM_CODE_END(do_tdx_hypercall)
+
+/* Helper function for standard type of TDVMCALL */
+SYM_FUNC_START(__tdx_hypercall)
+	/* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+	xor %r10, %r10
+	call do_tdx_hypercall
+	retq
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 6a7193fead08..cbfefc42641e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -1,8 +1,47 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (C) 2020 Intel Corporation */
 
+#define pr_fmt(fmt) "TDX: " fmt
+
 #include <asm/tdx.h>
 
+/*
+ * Wrapper for simple hypercalls that only return a success/error code.
+ */
+static inline u64 tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+	u64 err;
+
+	err = __tdx_hypercall(fn, r12, r13, r14, r15, NULL);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return err;
+}
+
+/*
+ * Wrapper for the semi-common case where we need single output
+ * value (R11). Callers of this function does not care about the
+ * hypercall error code (mainly for IN or MMIO usecase).
+ */
+static inline u64 tdx_hypercall_out_r11(u64 fn, u64 r12, u64 r13,
+					u64 r14, u64 r15)
+{
+
+	struct tdx_hypercall_output out = {0};
+	u64 err;
+
+	err = __tdx_hypercall(fn, r12, r13, r14, r15, &out);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return out.r11;
+}
+
 static inline bool cpuid_has_tdx_guest(void)
 {
 	u32 eax, signature[3];
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2 00/32] Add TDX Guest Support
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (31 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 32/32] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kuppuswamy Sathyanarayanan
@ 2021-05-03 23:21 ` Kuppuswamy, Sathyanarayanan
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-03 23:21 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

Hi Peter/Andy,

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> Hi All,

Just a gentle ping. Please let me know your comments on this patch set.
I hope it addressed concerns raised by you in RFC v1.

> 
> NOTE: This series is not ready for wide public review. It is being
> specifically posted so that Peter Z and other experts on the entry
> code can look for problems with the new exception handler (#VE).
> That's also why x86@ is not being spammed.
> 
> Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
> hosts and some physical attacks. This series adds the bare-minimum
> support to run a TDX guest. The host-side support will be submitted
> separately. Also support for advanced TD guest features like attestation
> or debug-mode will be submitted separately. Also, at this point it is not
> secure with some known holes in drivers, and also hasn’t been fully audited
> and fuzzed yet.
> 
> TDX has a lot of similarities to SEV. It enhances confidentiality and
> of guest memory and state (like registers) and includes a new exception
> (#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
> yet), TDX limits the host's ability to effect changes in the guest
> physical address space.
> 
> In contrast to the SEV code in the kernel, TDX guest memory is integrity
> protected and isolated; the host is prevented from accessing guest
> memory (even ciphertext).
> 
> The TDX architecture also includes a new CPU mode called
> Secure-Arbitration Mode (SEAM). The software (TDX module) running in this
> mode arbitrates interactions between host and guest and implements many of
> the guarantees of the TDX architecture.
> 
> Some of the key differences between TD and regular VM is,
> 
> 1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
> 2. A new #VE exception handler is added. The TDX module injects #VE exception
>     to the guest TD in cases of instructions that need to be emulated, disallowed
>     MSR accesses, subset of CPUID leaves, etc.
> 3. By default memory is marked as private, and TD will selectively share it with
>     VMM based on need.
> 4. Remote attestation is supported to enable a third party (either the owner of
>     the workload or a user of the services provided by the workload) to establish
>     that the workload is running on an Intel-TDX-enabled platform located within a
>     TD prior to providing that workload data.
> 
> You can find TDX related documents in the following link.
> 
> https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html
> 
> Changes since v1:
>   * Implemented tdcall() and tdvmcall() helper functions in assembly and renamed
>     them as __tdcall() and __tdvmcall().
>   * Added do_general_protection() helper function to re-use protection
>     code between #GP exception and TDX #VE exception handlers.
>   * Addressed syscall gap issue in #VE handler support (for details check
>     the commit log in "x86/traps: Add #VE support for TDX guest").
>   * Modified patch titled "x86/tdx: Handle port I/O" to re-use common
>     tdvmcall() helper function.
>   * Added error handling support to MADT CPU wakeup code.
>   * Introduced enum tdx_map_type to identify SHARED vs PRIVATE memory type.
>   * Enabled shared memory in IOAPIC driver.
>   * Added BINUTILS version info for TDCALL.
>   * Changed the TDVMCALL vendor id from 0 to "TDX.KVM".
>   * Replaced WARN() with pr_warn_ratelimited() in __tdvmcall() wrappers.
>   * Fixed commit log and code comments related review comments.
>   * Renamed patch titled # "x86/topology: Disable CPU hotplug support for TDX
>     platforms" to "x86/topology: Disable CPU online/offline control for
>     TDX guest"
>   * Rebased on top of v5.12 kernel.
> 
> 
> Erik Kaneda (1):
>    ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Structure
> 
> Isaku Yamahata (1):
>    x86/tdx: ioapic: Add shared bit for IOAPIC base address
> 
> Kirill A. Shutemov (16):
>    x86/paravirt: Introduce CONFIG_PARAVIRT_XL
>    x86/tdx: Get TD execution environment information via TDINFO
>    x86/traps: Add #VE support for TDX guest
>    x86/tdx: Add HLT support for TDX guest
>    x86/tdx: Wire up KVM hypercalls
>    x86/tdx: Add MSR support for TDX guest
>    x86/tdx: Handle CPUID via #VE
>    x86/io: Allow to override inX() and outX() implementation
>    x86/tdx: Handle port I/O
>    x86/tdx: Handle in-kernel MMIO
>    x86/mm: Move force_dma_unencrypted() to common code
>    x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
>    x86/tdx: Make pages shared in ioremap()
>    x86/tdx: Add helper to do MapGPA TDVMALL
>    x86/tdx: Make DMA pages shared
>    x86/kvm: Use bounce buffers for TD guest
> 
> Kuppuswamy Sathyanarayanan (10):
>    x86/tdx: Introduce INTEL_TDX_GUEST config option
>    x86/cpufeatures: Add TDX Guest CPU feature
>    x86/x86: Add is_tdx_guest() interface
>    x86/tdx: Add __tdcall() and __tdvmcall() helper functions
>    x86/traps: Add do_general_protection() helper function
>    x86/tdx: Handle MWAIT, MONITOR and WBINVD
>    ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Mailbox Structure
>    ACPI/table: Print MADT Wake table information
>    x86/acpi, x86/boot: Add multiprocessor wake-up support
>    x86/topology: Disable CPU online/offline control for TDX guest
> 
> Sean Christopherson (4):
>    x86/boot: Add a trampoline for APs booting in 64-bit mode
>    x86/boot: Avoid #VE during compressed boot for TDX platforms
>    x86/boot: Avoid unnecessary #VE during boot process
>    x86/tdx: Forcefully disable legacy PIC for TDX guests
> 
>   arch/x86/Kconfig                         |  28 +-
>   arch/x86/boot/compressed/Makefile        |   2 +
>   arch/x86/boot/compressed/head_64.S       |  10 +-
>   arch/x86/boot/compressed/misc.h          |   1 +
>   arch/x86/boot/compressed/pgtable.h       |   2 +-
>   arch/x86/boot/compressed/tdcall.S        |   9 +
>   arch/x86/boot/compressed/tdx.c           |  32 ++
>   arch/x86/include/asm/apic.h              |   3 +
>   arch/x86/include/asm/cpufeatures.h       |   1 +
>   arch/x86/include/asm/idtentry.h          |   4 +
>   arch/x86/include/asm/io.h                |  24 +-
>   arch/x86/include/asm/irqflags.h          |  38 +-
>   arch/x86/include/asm/kvm_para.h          |  21 +
>   arch/x86/include/asm/paravirt.h          |  22 +-
>   arch/x86/include/asm/paravirt_types.h    |   3 +-
>   arch/x86/include/asm/pgtable.h           |   3 +
>   arch/x86/include/asm/realmode.h          |   1 +
>   arch/x86/include/asm/tdx.h               | 176 +++++++++
>   arch/x86/kernel/Makefile                 |   1 +
>   arch/x86/kernel/acpi/boot.c              |  79 ++++
>   arch/x86/kernel/apic/apic.c              |   8 +
>   arch/x86/kernel/apic/io_apic.c           |  12 +-
>   arch/x86/kernel/asm-offsets.c            |  22 ++
>   arch/x86/kernel/head64.c                 |   3 +
>   arch/x86/kernel/head_64.S                |  13 +-
>   arch/x86/kernel/idt.c                    |   6 +
>   arch/x86/kernel/paravirt.c               |   4 +-
>   arch/x86/kernel/pci-swiotlb.c            |   2 +-
>   arch/x86/kernel/smpboot.c                |   5 +
>   arch/x86/kernel/tdcall.S                 | 361 +++++++++++++++++
>   arch/x86/kernel/tdx-kvm.c                |  45 +++
>   arch/x86/kernel/tdx.c                    | 480 +++++++++++++++++++++++
>   arch/x86/kernel/topology.c               |   3 +-
>   arch/x86/kernel/traps.c                  |  81 ++--
>   arch/x86/mm/Makefile                     |   2 +
>   arch/x86/mm/ioremap.c                    |   8 +-
>   arch/x86/mm/mem_encrypt.c                |  75 ----
>   arch/x86/mm/mem_encrypt_common.c         |  85 ++++
>   arch/x86/mm/mem_encrypt_identity.c       |   1 +
>   arch/x86/mm/pat/set_memory.c             |  48 ++-
>   arch/x86/realmode/rm/header.S            |   1 +
>   arch/x86/realmode/rm/trampoline_64.S     |  49 ++-
>   arch/x86/realmode/rm/trampoline_common.S |   5 +-
>   drivers/acpi/tables.c                    |  11 +
>   include/acpi/actbl2.h                    |  26 +-
>   45 files changed, 1654 insertions(+), 162 deletions(-)
>   create mode 100644 arch/x86/boot/compressed/tdcall.S
>   create mode 100644 arch/x86/boot/compressed/tdx.c
>   create mode 100644 arch/x86/include/asm/tdx.h
>   create mode 100644 arch/x86/kernel/tdcall.S
>   create mode 100644 arch/x86/kernel/tdx-kvm.c
>   create mode 100644 arch/x86/kernel/tdx.c
>   create mode 100644 arch/x86/mm/mem_encrypt_common.c
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-04-27 17:31   ` Borislav Petkov
@ 2021-05-06 14:59     ` Kirill A. Shutemov
  2021-05-10  8:07     ` Juergen Gross
  1 sibling, 0 replies; 381+ messages in thread
From: Kirill A. Shutemov @ 2021-05-06 14:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Dan Williams, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel, Jürgen Gross

[-- Attachment #1: Type: text/plain, Size: 276 bytes --]

On Tue, Apr 27, 2021 at 07:31:09PM +0200, Borislav Petkov wrote:
> Or something to that effect.

See the couple of attached patches. Does look along the lines you wanted?

The first one renames PARAVIRT_XXL and the second one introduces
PARAVIRT_HLT.

-- 
 Kirill A. Shutemov

[-- Attachment #2: 0001-x86-paravirt-Rename-PARAVIRT_XXL-to-PARAVIRT_FULL.patch --]
[-- Type: text/x-diff, Size: 20397 bytes --]

From 2671a044687da0e0c7105beb3467a270b8863a1b Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Thu, 6 May 2021 17:04:42 +0300
Subject: [PATCH 1/2] x86/paravirt: Rename PARAVIRT_XXL to PARAVIRT_FULL

PARAVIRT_XXL provides a way to hook up a full set paravirt ops.
Rename it to PARAVIRT_FULL to be more self-descriptive.

It's a preparation for the next patch.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                               |  2 +-
 arch/x86/boot/compressed/misc.h                |  2 +-
 arch/x86/entry/vdso/vdso32/vclock_gettime.c    |  2 +-
 arch/x86/include/asm/debugreg.h                |  2 +-
 arch/x86/include/asm/desc.h                    |  4 ++--
 arch/x86/include/asm/fixmap.h                  |  4 ++--
 arch/x86/include/asm/io_bitmap.h               |  2 +-
 arch/x86/include/asm/irqflags.h                |  4 ++--
 arch/x86/include/asm/mmu_context.h             |  4 ++--
 arch/x86/include/asm/msr.h                     |  4 ++--
 arch/x86/include/asm/paravirt.h                | 12 ++++++------
 arch/x86/include/asm/paravirt_types.h          | 10 +++++-----
 arch/x86/include/asm/pgalloc.h                 |  2 +-
 arch/x86/include/asm/pgtable.h                 |  6 +++---
 arch/x86/include/asm/processor.h               |  4 ++--
 arch/x86/include/asm/ptrace.h                  |  2 +-
 arch/x86/include/asm/required-features.h       |  2 +-
 arch/x86/include/asm/special_insns.h           |  4 ++--
 arch/x86/kernel/asm-offsets.c                  |  2 +-
 arch/x86/kernel/asm-offsets_64.c               |  2 +-
 arch/x86/kernel/paravirt.c                     | 18 +++++++++---------
 arch/x86/kernel/paravirt_patch.c               |  6 +++---
 arch/x86/mm/mem_encrypt_identity.c             |  2 +-
 arch/x86/xen/Kconfig                           |  2 +-
 tools/arch/x86/include/asm/required-features.h |  2 +-
 25 files changed, 53 insertions(+), 53 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..568b96e20d59 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -783,7 +783,7 @@ config PARAVIRT
 	  over full virtualization.  However, when run without a hypervisor
 	  the kernel is theoretically slower and slightly larger.
 
-config PARAVIRT_XXL
+config PARAVIRT_FULL
 	bool
 
 config PARAVIRT_DEBUG
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 901ea5ebec22..0e5713c1cb86 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -9,7 +9,7 @@
  * paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
-#undef CONFIG_PARAVIRT_XXL
+#undef CONFIG_PARAVIRT_FULL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 #undef CONFIG_KASAN
 #undef CONFIG_KASAN_GENERIC
diff --git a/arch/x86/entry/vdso/vdso32/vclock_gettime.c b/arch/x86/entry/vdso/vdso32/vclock_gettime.c
index 283ed9d00426..6f543b40b1f4 100644
--- a/arch/x86/entry/vdso/vdso32/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vdso32/vclock_gettime.c
@@ -14,7 +14,7 @@
 #undef CONFIG_ILLEGAL_POINTER_VALUE
 #undef CONFIG_SPARSEMEM_VMEMMAP
 #undef CONFIG_NR_CPUS
-#undef CONFIG_PARAVIRT_XXL
+#undef CONFIG_PARAVIRT_FULL
 
 #define CONFIG_X86_32 1
 #define CONFIG_PGTABLE_LEVELS 2
diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index cfdf307ddc01..c4c9b9cbda55 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -8,7 +8,7 @@
 
 DECLARE_PER_CPU(unsigned long, cpu_dr7);
 
-#ifndef CONFIG_PARAVIRT_XXL
+#ifndef CONFIG_PARAVIRT_FULL
 /*
  * These special macros can be used to get or set a debugging register
  */
diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 476082a83d1c..51b77118307b 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -103,7 +103,7 @@ static inline int desc_empty(const void *ptr)
 	return !(desc[0] | desc[1]);
 }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
 #else
 #define load_TR_desc()				native_load_tr_desc()
@@ -129,7 +129,7 @@ static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
 static inline void paravirt_free_ldt(struct desc_struct *ldt, unsigned entries)
 {
 }
-#endif	/* CONFIG_PARAVIRT_XXL */
+#endif	/* CONFIG_PARAVIRT_FULL */
 
 #define store_ldt(ldt) asm("sldt %0" : "=m"(ldt))
 
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index d0dcefb5cc59..a0a4db7b255e 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -105,7 +105,7 @@ enum fixed_addresses {
 	FIX_PCIE_MCFG,
 #endif
 #endif
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	FIX_PARAVIRT_BOOTMAP,
 #endif
 
@@ -160,7 +160,7 @@ void __native_set_fixmap(enum fixed_addresses idx, pte_t pte);
 void native_set_fixmap(unsigned /* enum fixed_addresses */ idx,
 		       phys_addr_t phys, pgprot_t flags);
 
-#ifndef CONFIG_PARAVIRT_XXL
+#ifndef CONFIG_PARAVIRT_FULL
 static inline void __set_fixmap(enum fixed_addresses idx,
 				phys_addr_t phys, pgprot_t flags)
 {
diff --git a/arch/x86/include/asm/io_bitmap.h b/arch/x86/include/asm/io_bitmap.h
index 7f080f5c7def..2c20cd0669d3 100644
--- a/arch/x86/include/asm/io_bitmap.h
+++ b/arch/x86/include/asm/io_bitmap.h
@@ -36,7 +36,7 @@ static inline void native_tss_invalidate_io_bitmap(void)
 
 void native_tss_update_io_bitmap(void);
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
 #else
 #define tss_update_io_bitmap native_tss_update_io_bitmap
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 144d70ea4393..a4d7dbc2b034 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -59,7 +59,7 @@ static inline __cpuidle void native_halt(void)
 
 #endif
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
 #else
 #ifndef __ASSEMBLY__
@@ -124,7 +124,7 @@ static __always_inline unsigned long arch_local_irq_save(void)
 #endif
 
 #endif /* __ASSEMBLY__ */
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_FULL */
 
 #ifndef __ASSEMBLY__
 static __always_inline int arch_irqs_disabled_flags(unsigned long flags)
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 27516046117a..98949a97daf3 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -15,12 +15,12 @@
 
 extern atomic64_t last_mm_ctx_id;
 
-#ifndef CONFIG_PARAVIRT_XXL
+#ifndef CONFIG_PARAVIRT_FULL
 static inline void paravirt_activate_mm(struct mm_struct *prev,
 					struct mm_struct *next)
 {
 }
-#endif	/* !CONFIG_PARAVIRT_XXL */
+#endif	/* !CONFIG_PARAVIRT_FULL */
 
 #ifdef CONFIG_PERF_EVENTS
 DECLARE_STATIC_KEY_FALSE(rdpmc_never_available_key);
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index e16cccdd0420..7d1c97093780 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -251,7 +251,7 @@ static inline unsigned long long native_read_pmc(int counter)
 	return EAX_EDX_VAL(val, low, high);
 }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
 #else
 #include <linux/errno.h>
@@ -314,7 +314,7 @@ do {							\
 
 #define rdpmcl(counter, val) ((val) = native_read_pmc(counter))
 
-#endif	/* !CONFIG_PARAVIRT_XXL */
+#endif	/* !CONFIG_PARAVIRT_FULL */
 
 /*
  * 64-bit version of wrmsr_safe():
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4abf110e2243..02751519b0d9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,7 +84,7 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 	PVOP_VCALL1(mmu.exit_mmap, mm);
 }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 static inline void load_sp0(unsigned long sp0)
 {
 	PVOP_VCALL1(cpu.load_sp0, sp0);
@@ -642,7 +642,7 @@ bool __raw_callee_save___native_vcpu_is_preempted(long cpu);
 #define __PV_IS_CALLEE_SAVE(func)			\
 	((struct paravirt_callee_save) { func })
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 static inline notrace unsigned long arch_local_save_flags(void)
 {
 	return PVOP_CALLEE0(unsigned long, irq.save_fl);
@@ -748,7 +748,7 @@ extern void default_banner(void);
 #define PARA_INDIRECT(addr)	*%cs:addr
 #endif
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #define INTERRUPT_RETURN						\
 	PARA_SITE(PARA_PATCH(PV_CPU_iret),				\
 		  ANNOTATE_RETPOLINE_SAFE;				\
@@ -770,7 +770,7 @@ extern void default_banner(void);
 #endif
 
 #ifdef CONFIG_X86_64
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #ifdef CONFIG_DEBUG_ENTRY
 #define SAVE_FLAGS(clobbers)                                        \
 	PARA_SITE(PARA_PATCH(PV_IRQ_save_fl),			    \
@@ -779,7 +779,7 @@ extern void default_banner(void);
 		  call PARA_INDIRECT(pv_ops+PV_IRQ_save_fl);	    \
 		  PV_RESTORE_REGS(clobbers | CLBR_CALLEE_SAVE);)
 #endif
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_FULL */
 #endif	/* CONFIG_X86_64 */
 
 #endif /* __ASSEMBLY__ */
@@ -788,7 +788,7 @@ extern void default_banner(void);
 #endif /* !CONFIG_PARAVIRT */
 
 #ifndef __ASSEMBLY__
-#ifndef CONFIG_PARAVIRT_XXL
+#ifndef CONFIG_PARAVIRT_FULL
 static inline void paravirt_arch_dup_mmap(struct mm_struct *oldmm,
 					  struct mm_struct *mm)
 {
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index de87087d3bde..ae3503b2e8a2 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -66,7 +66,7 @@ struct paravirt_callee_save {
 
 /* general info */
 struct pv_info {
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	u16 extra_user_64bit_cs;  /* __USER_CS if none */
 #endif
 
@@ -86,7 +86,7 @@ struct pv_init_ops {
 			  unsigned long addr, unsigned len);
 } __no_randomize_layout;
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 struct pv_lazy_ops {
 	/* Set deferred update mode, used for batching operations. */
 	void (*enter)(void);
@@ -104,7 +104,7 @@ struct pv_cpu_ops {
 	/* hooks for various privileged instructions */
 	void (*io_delay)(void);
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	unsigned long (*get_debugreg)(int regno);
 	void (*set_debugreg)(int regno, unsigned long value);
 
@@ -166,7 +166,7 @@ struct pv_cpu_ops {
 } __no_randomize_layout;
 
 struct pv_irq_ops {
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	/*
 	 * Get/set interrupt state.  save_fl is expected to use X86_EFLAGS_IF;
 	 * all other bits returned from save_fl are undefined.
@@ -196,7 +196,7 @@ struct pv_mmu_ops {
 	/* Hook for intercepting the destruction of an mm_struct. */
 	void (*exit_mmap)(struct mm_struct *mm);
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	struct paravirt_callee_save read_cr2;
 	void (*write_cr2)(unsigned long);
 
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index 62ad61d6fefc..7bd2744b52ba 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -12,7 +12,7 @@
 
 static inline int  __paravirt_pgd_alloc(struct mm_struct *mm) { return 0; }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
 #else
 #define paravirt_pgd_alloc(mm)	__paravirt_pgd_alloc(mm)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..8c4eecc0444a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -59,9 +59,9 @@ extern struct mm_struct *pgd_page_get_mm(struct page *page);
 
 extern pmdval_t early_pmd_flags;
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
-#else  /* !CONFIG_PARAVIRT_XXL */
+#else  /* !CONFIG_PARAVIRT_FULL */
 #define set_pte(ptep, pte)		native_set_pte(ptep, pte)
 
 #define set_pte_atomic(ptep, pte)					\
@@ -115,7 +115,7 @@ extern pmdval_t early_pmd_flags;
 #define __pte(x)	native_make_pte(x)
 
 #define arch_end_context_switch(prev)	do {} while(0)
-#endif	/* CONFIG_PARAVIRT_XXL */
+#endif	/* CONFIG_PARAVIRT_FULL */
 
 /*
  * The following only work if pte_present() is true.
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f1b9ed5efaa9..47c4eb146a87 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -580,7 +580,7 @@ static inline bool on_thread_stack(void)
 			       current_stack_pointer) < THREAD_SIZE;
 }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
 #else
 #define __cpuid			native_cpuid
@@ -590,7 +590,7 @@ static inline void load_sp0(unsigned long sp0)
 	native_load_sp0(sp0);
 }
 
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_FULL */
 
 /* Free all resources held by a thread. */
 extern void release_thread(struct task_struct *);
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 409f661481e1..0f8adc38fc03 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -146,7 +146,7 @@ static inline int v8086_mode(struct pt_regs *regs)
 static inline bool user_64bit_mode(struct pt_regs *regs)
 {
 #ifdef CONFIG_X86_64
-#ifndef CONFIG_PARAVIRT_XXL
+#ifndef CONFIG_PARAVIRT_FULL
 	/*
 	 * On non-paravirt systems, this is the only long mode CPL 3
 	 * selector.  We do not allow long mode selectors in the LDT.
diff --git a/arch/x86/include/asm/required-features.h b/arch/x86/include/asm/required-features.h
index b2d504f11937..e37ef3a4cbd3 100644
--- a/arch/x86/include/asm/required-features.h
+++ b/arch/x86/include/asm/required-features.h
@@ -54,7 +54,7 @@
 #endif
 
 #ifdef CONFIG_X86_64
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 /* Paravirtualized systems may not have PSE or PGE available */
 #define NEED_PSE	0
 #define NEED_PGE	0
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 1d3cbaef4bb7..f26fc9acf4cc 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -148,7 +148,7 @@ static inline unsigned long __read_cr4(void)
 	return native_read_cr4();
 }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
 #else
 
@@ -205,7 +205,7 @@ static inline void load_gs_index(unsigned int selector)
 
 #endif
 
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_FULL */
 
 static inline void clflush(volatile void *__p)
 {
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..cc247c723c5e 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -61,7 +61,7 @@ static void __used common(void)
 	OFFSET(IA32_RT_SIGFRAME_sigcontext, rt_sigframe_ia32, uc.uc_mcontext);
 #endif
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	BLANK();
 	OFFSET(PV_IRQ_irq_disable, paravirt_patch_template, irq.irq_disable);
 	OFFSET(PV_IRQ_irq_enable, paravirt_patch_template, irq.irq_enable);
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index b14533af7676..7bc5cb486eca 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -12,7 +12,7 @@
 int main(void)
 {
 #ifdef CONFIG_PARAVIRT
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #ifdef CONFIG_DEBUG_ENTRY
 	OFFSET(PV_IRQ_save_fl, paravirt_patch_template, irq.save_fl);
 #endif
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c60222ab8ab9..e3a5f0cf9340 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -79,7 +79,7 @@ static unsigned paravirt_patch_call(void *insn_buff, const void *target,
 	return call_len;
 }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 /* identity function, which can be inlined */
 u64 notrace _paravirt_ident_64(u64 x)
 {
@@ -130,7 +130,7 @@ unsigned paravirt_patch_default(u8 type, void *insn_buff,
 	else if (opfunc == _paravirt_nop)
 		ret = 0;
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	/* identity functions just return their single argument */
 	else if (opfunc == _paravirt_ident_64)
 		ret = paravirt_patch_ident_64(insn_buff, len);
@@ -227,7 +227,7 @@ void paravirt_flush_lazy_mmu(void)
 	preempt_enable();
 }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 void paravirt_start_context_switch(struct task_struct *prev)
 {
 	BUG_ON(preemptible());
@@ -260,7 +260,7 @@ enum paravirt_lazy_mode paravirt_get_lazy_mode(void)
 
 struct pv_info pv_info = {
 	.name = "bare hardware",
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	.extra_user_64bit_cs = __USER_CS,
 #endif
 };
@@ -279,7 +279,7 @@ struct paravirt_patch_template pv_ops = {
 	/* Cpu ops. */
 	.cpu.io_delay		= native_io_delay,
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	.cpu.cpuid		= native_cpuid,
 	.cpu.get_debugreg	= native_get_debugreg,
 	.cpu.set_debugreg	= native_set_debugreg,
@@ -324,7 +324,7 @@ struct paravirt_patch_template pv_ops = {
 	.irq.irq_enable		= __PV_IS_CALLEE_SAVE(native_irq_enable),
 	.irq.safe_halt		= native_safe_halt,
 	.irq.halt		= native_halt,
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_FULL */
 
 	/* Mmu ops. */
 	.mmu.flush_tlb_user	= native_flush_tlb_local,
@@ -336,7 +336,7 @@ struct paravirt_patch_template pv_ops = {
 
 	.mmu.exit_mmap		= paravirt_nop,
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	.mmu.read_cr2		= __PV_IS_CALLEE_SAVE(native_read_cr2),
 	.mmu.write_cr2		= native_write_cr2,
 	.mmu.read_cr3		= __native_read_cr3,
@@ -393,7 +393,7 @@ struct paravirt_patch_template pv_ops = {
 	},
 
 	.mmu.set_fixmap		= native_set_fixmap,
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_FULL */
 
 #if defined(CONFIG_PARAVIRT_SPINLOCKS)
 	/* Lock ops. */
@@ -409,7 +409,7 @@ struct paravirt_patch_template pv_ops = {
 #endif
 };
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 /* At this point, native_get/set_debugreg has real function entries */
 NOKPROBE_SYMBOL(native_get_debugreg);
 NOKPROBE_SYMBOL(native_set_debugreg);
diff --git a/arch/x86/kernel/paravirt_patch.c b/arch/x86/kernel/paravirt_patch.c
index abd27ec67397..d100993dfdb3 100644
--- a/arch/x86/kernel/paravirt_patch.c
+++ b/arch/x86/kernel/paravirt_patch.c
@@ -17,7 +17,7 @@
 	case PARAVIRT_PATCH(ops.m):					\
 		return PATCH(data, ops##_##m, insn_buff, len)
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 struct patch_xxl {
 	const unsigned char	irq_irq_disable[1];
 	const unsigned char	irq_irq_enable[1];
@@ -44,7 +44,7 @@ unsigned int paravirt_patch_ident_64(void *insn_buff, unsigned int len)
 {
 	return PATCH(xxl, mov64, insn_buff, len);
 }
-# endif /* CONFIG_PARAVIRT_XXL */
+# endif /* CONFIG_PARAVIRT_FULL */
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 struct patch_lock {
@@ -68,7 +68,7 @@ unsigned int native_patch(u8 type, void *insn_buff, unsigned long addr,
 {
 	switch (type) {
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	PATCH_CASE(irq, save_fl, xxl, insn_buff, len);
 	PATCH_CASE(irq, irq_enable, xxl, insn_buff, len);
 	PATCH_CASE(irq, irq_disable, xxl, insn_buff, len);
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 6c5eb6f3f14f..53fe895cefe2 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -24,7 +24,7 @@
  * be extended when new paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
-#undef CONFIG_PARAVIRT_XXL
+#undef CONFIG_PARAVIRT_FULL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 
 #include <linux/kernel.h>
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index afc1da68b06d..aa96670248e7 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -20,7 +20,7 @@ config XEN_PV
 	default y
 	depends on XEN
 	depends on X86_64
-	select PARAVIRT_XXL
+	select PARAVIRT_FULL
 	select XEN_HAVE_PVMMU
 	select XEN_HAVE_VPMU
 	help
diff --git a/tools/arch/x86/include/asm/required-features.h b/tools/arch/x86/include/asm/required-features.h
index b2d504f11937..e37ef3a4cbd3 100644
--- a/tools/arch/x86/include/asm/required-features.h
+++ b/tools/arch/x86/include/asm/required-features.h
@@ -54,7 +54,7 @@
 #endif
 
 #ifdef CONFIG_X86_64
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 /* Paravirtualized systems may not have PSE or PGE available */
 #define NEED_PSE	0
 #define NEED_PGE	0
-- 
2.26.3


[-- Attachment #3: 0002-x86-paravirt-Introduce-CONFIG_PARAVIRT_HLT.patch --]
[-- Type: text/x-diff, Size: 5575 bytes --]

From 08e03cebb65b48e6b7e5bf0d805762cc661d09f0 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Tue, 19 Nov 2019 14:56:03 +0300
Subject: [PATCH 2/2] x86/paravirt: Introduce CONFIG_PARAVIRT_HLT

CONFIG_PARAVIRT_FULL provides a way to hook up the full set of paravirt
ops, but TDX only needs two of them: halt() and safe_halt().

Split off halt paravirt calls from CONFIG_PARAVIRT_FULL into
a separate config option -- CONFIG_PARAVIRT_HLT.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                      |  4 +++
 arch/x86/boot/compressed/misc.h       |  1 +
 arch/x86/include/asm/irqflags.h       | 38 +++++++++++++++------------
 arch/x86/include/asm/paravirt.h       | 22 +++++++++-------
 arch/x86/include/asm/paravirt_types.h |  3 ++-
 arch/x86/kernel/paravirt.c            |  4 ++-
 arch/x86/mm/mem_encrypt_identity.c    |  1 +
 7 files changed, 44 insertions(+), 29 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 568b96e20d59..830367e36d5a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -783,8 +783,12 @@ config PARAVIRT
 	  over full virtualization.  However, when run without a hypervisor
 	  the kernel is theoretically slower and slightly larger.
 
+config PARAVIRT_HLT
+	bool
+
 config PARAVIRT_FULL
 	bool
+	select PARAVIRT_HLT
 
 config PARAVIRT_DEBUG
 	bool "paravirt-ops debugging"
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 0e5713c1cb86..293f22dbada4 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -9,6 +9,7 @@
  * paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
+#undef CONFIG_PARAVIRT_HLT
 #undef CONFIG_PARAVIRT_FULL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 #undef CONFIG_KASAN
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index a4d7dbc2b034..ae839e74fc34 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -59,27 +59,11 @@ static inline __cpuidle void native_halt(void)
 
 #endif
 
-#ifdef CONFIG_PARAVIRT_FULL
+#ifdef CONFIG_PARAVIRT_HLT
 #include <asm/paravirt.h>
 #else
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
-
-static __always_inline unsigned long arch_local_save_flags(void)
-{
-	return native_save_fl();
-}
-
-static __always_inline void arch_local_irq_disable(void)
-{
-	native_irq_disable();
-}
-
-static __always_inline void arch_local_irq_enable(void)
-{
-	native_irq_enable();
-}
-
 /*
  * Used in the idle loop; sti takes one instruction cycle
  * to complete:
@@ -97,6 +81,26 @@ static inline __cpuidle void halt(void)
 {
 	native_halt();
 }
+#endif /* !__ASSEMBLY__ */
+#endif /* CONFIG_PARAVIRT_HLT */
+
+#ifndef CONFIG_PARAVIRT_FULL
+#ifndef __ASSEMBLY__
+
+static __always_inline unsigned long arch_local_save_flags(void)
+{
+	return native_save_fl();
+}
+
+static __always_inline void arch_local_irq_disable(void)
+{
+	native_irq_disable();
+}
+
+static __always_inline void arch_local_irq_enable(void)
+{
+	native_irq_enable();
+}
 
 /*
  * For spinlocks, etc:
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 02751519b0d9..6bc5c1eab6eb 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,6 +84,18 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 	PVOP_VCALL1(mmu.exit_mmap, mm);
 }
 
+#ifdef CONFIG_PARAVIRT_HLT
+static inline void arch_safe_halt(void)
+{
+	PVOP_VCALL0(irq.safe_halt);
+}
+
+static inline void halt(void)
+{
+	PVOP_VCALL0(irq.halt);
+}
+#endif
+
 #ifdef CONFIG_PARAVIRT_FULL
 static inline void load_sp0(unsigned long sp0)
 {
@@ -145,16 +157,6 @@ static inline void __write_cr4(unsigned long x)
 	PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-static inline void arch_safe_halt(void)
-{
-	PVOP_VCALL0(irq.safe_halt);
-}
-
-static inline void halt(void)
-{
-	PVOP_VCALL0(irq.halt);
-}
-
 static inline void wbinvd(void)
 {
 	PVOP_VCALL0(cpu.wbinvd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index ae3503b2e8a2..cac32ffd95ca 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -177,7 +177,8 @@ struct pv_irq_ops {
 	struct paravirt_callee_save save_fl;
 	struct paravirt_callee_save irq_disable;
 	struct paravirt_callee_save irq_enable;
-
+#endif
+#ifdef CONFIG_PARAVIRT_HLT
 	void (*safe_halt)(void);
 	void (*halt)(void);
 #endif
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index e3a5f0cf9340..752fc3ab81bf 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -322,9 +322,11 @@ struct paravirt_patch_template pv_ops = {
 	.irq.save_fl		= __PV_IS_CALLEE_SAVE(native_save_fl),
 	.irq.irq_disable	= __PV_IS_CALLEE_SAVE(native_irq_disable),
 	.irq.irq_enable		= __PV_IS_CALLEE_SAVE(native_irq_enable),
+#endif /* CONFIG_PARAVIRT_FULL */
+#ifdef CONFIG_PARAVIRT_HLT
 	.irq.safe_halt		= native_safe_halt,
 	.irq.halt		= native_halt,
-#endif /* CONFIG_PARAVIRT_FULL */
+#endif /* CONFIG_PARAVIRT_HLT */
 
 	/* Mmu ops. */
 	.mmu.flush_tlb_user	= native_flush_tlb_local,
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 53fe895cefe2..7cb9b70edbe7 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -24,6 +24,7 @@
  * be extended when new paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
+#undef CONFIG_PARAVIRT_HLT
 #undef CONFIG_PARAVIRT_FULL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2 07/32] x86/traps: Add do_general_protection() helper function
  2021-04-26 18:01 ` [RFC v2 07/32] x86/traps: Add do_general_protection() helper function Kuppuswamy Sathyanarayanan
@ 2021-05-07 21:20   ` Dave Hansen
  0 siblings, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-07 21:20 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> TDX guest #VE exception handler treats unsupported exceptions

 ^ The

> as #GP. So to handle the #GP, move the protection fault handler

s/So to/To/

Also, it does not "treat them as #GP".  It handles them in the same way
that a #GP is handled.  There's a difference between literally making
them a #GP and having a similar end result.  This description conflates
them.

> code to out of exc_general_protection() and create new helper
> function for it.

I wouldn't name the functions.  Just say that you want the #GP behavior
from #VE so you need a common helper.

> Also since exception handler is responsible to decide when to

	    ^ an

> turn on/off IRQ, move cond_local_irq_{enable/disable)() calls
> out of do_general_protection().

This paragraph doesn't really say anything meaningful.  Yes, exception
handlers reenable interrupts.  Try to *SAY* something about why they do
this and why you have to move the code around.  Or, just axe it.

> This is a preparatory patch for adding #VE exception handler
> support for TDX guests.
> 
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/kernel/traps.c | 51 ++++++++++++++++++++++-------------------
>  1 file changed, 27 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 651e3e508959..213d4aa8e337 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -527,44 +527,28 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
>  
>  #define GPFSTR "general protection fault"
>  
> -DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> +static void do_general_protection(struct pt_regs *regs, long error_code)
>  {
>  	char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
>  	enum kernel_gp_hint hint = GP_NO_HINT;
> -	struct task_struct *tsk;
> +	struct task_struct *tsk = current;
>  	unsigned long gp_addr;
>  	int ret;
>  
> -	cond_local_irq_enable(regs);
> -
> -	if (static_cpu_has(X86_FEATURE_UMIP)) {
> -		if (user_mode(regs) && fixup_umip_exception(regs))
> -			goto exit;
> -	}
> -
> -	if (v8086_mode(regs)) {
> -		local_irq_enable();
> -		handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
> -		local_irq_disable();
> -		return;
> -	}
> -
> -	tsk = current;
> -
>  	if (user_mode(regs)) {
>  		tsk->thread.error_code = error_code;
>  		tsk->thread.trap_nr = X86_TRAP_GP;
>  
>  		if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0))
> -			goto exit;
> +			return;
>  
>  		show_signal(tsk, SIGSEGV, "", desc, regs, error_code);
>  		force_sig(SIGSEGV);
> -		goto exit;
> +		return;
>  	}
>  
>  	if (fixup_exception(regs, X86_TRAP_GP, error_code, 0))
> -		goto exit;
> +		return;
>  
>  	tsk->thread.error_code = error_code;
>  	tsk->thread.trap_nr = X86_TRAP_GP;
> @@ -576,11 +560,11 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>  	if (!preemptible() &&
>  	    kprobe_running() &&
>  	    kprobe_fault_handler(regs, X86_TRAP_GP))
> -		goto exit;
> +		return;
>  
>  	ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV);

So... We're going to send signals based on #VE which use this bit in the
ABI which is documented as:

#define X86_TRAP_GP             13      /* General Protection Fault */

Considering that there is also a:

#define X86_TRAP_VE             20      /* Virtualization Exception */

this seems like a stretch.

Also, isnt there a lot of truly #GP-specific code in there, like
fixup_exception()?  Why do you need to call that for #VE?  How did you
decide what remains in the handler versus what gets separated out?

>  	if (ret == NOTIFY_STOP)
> -		goto exit;
> +		return;
>  
>  	if (error_code)
>  		snprintf(desc, sizeof(desc), "segment-related " GPFSTR);
> @@ -601,8 +585,27 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>  		gp_addr = 0;
>  
>  	die_addr(desc, regs, error_code, gp_addr);
> +}
>  
> -exit:
> +DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> +{
> +	cond_local_irq_enable(regs);
> +
> +	if (static_cpu_has(X86_FEATURE_UMIP)) {
> +		if (user_mode(regs) && fixup_umip_exception(regs)) {
> +			cond_local_irq_disable(regs);
> +			return;
> +		}
> +	}
> +
> +	if (v8086_mode(regs)) {
> +		local_irq_enable();
> +		handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
> +		local_irq_disable();
> +		return;
> +	}
> +
> +	do_general_protection(regs, error_code);
>  	cond_local_irq_disable(regs);
>  }


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-04-26 18:01 ` [RFC v2 08/32] x86/traps: Add #VE support for TDX guest Kuppuswamy Sathyanarayanan
@ 2021-05-07 21:36   ` Dave Hansen
  2021-05-13 19:47     ` Andi Kleen
  2021-05-18  0:09     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
  2021-06-08 17:02   ` [RFC v2 08/32] " Dave Hansen
  1 sibling, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-07 21:36 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
...
> The #VE cannot be nested before TDGETVEINFO is called, if there is any
> reason for it to nest the TD would shut down. The TDX module guarantees
> that no NMIs (or #MC or similar) can happen in this window. After
> TDGETVEINFO the #VE handler can nest if needed, although we don’t expect
> it to happen normally.

I think this description really needs some work.  Does "The #VE cannot
be nested" mean that "hardware guarantees that #VE will not be
generated", or "the #VE must not be nested"?

What does "the TD would shut down" mean?  I think you mean that instead
of delivering a nested #VE the hardware would actually exit to the host
and TDX would prevent the guest from being reentered.  Right?

> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index 5eb3bdf36a41..41a0732d5f68 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -619,6 +619,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER,	exc_xen_hypervisor_callback);
>  DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER,	exc_xen_unknown_trap);
>  #endif
>  
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +DECLARE_IDTENTRY(X86_TRAP_VE,		exc_virtualization_exception);
> +#endif
> +
>  /* Device interrupts common/spurious */
>  DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	common_interrupt);
>  #ifdef CONFIG_X86_LOCAL_APIC
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index c5a870cef0ae..1ca55d8e9963 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -11,6 +11,7 @@
>  #include <linux/types.h>
>  
>  #define TDINFO			1
> +#define TDGETVEINFO		3
>  
>  struct tdcall_output {
>  	u64 rcx;
> @@ -29,6 +30,20 @@ struct tdvmcall_output {
>  	u64 r15;
>  };
>  
> +struct ve_info {
> +	u64 exit_reason;
> +	u64 exit_qual;
> +	u64 gla;
> +	u64 gpa;
> +	u32 instr_len;
> +	u32 instr_info;
> +};

Is this an architectural structure or some software construct?

> +unsigned long tdg_get_ve_info(struct ve_info *ve);
> +
> +int tdg_handle_virtualization_exception(struct pt_regs *regs,
> +		struct ve_info *ve);
> +
>  /* Common API to check TDX support in decompression and common kernel code. */
>  bool is_tdx_guest(void);
>  
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index ee1a283f8e96..546b6b636c7d 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = {
>  	 */
>  	INTG(X86_TRAP_PF,		asm_exc_page_fault),
>  #endif
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
> +#endif
>  };
>  
>  /*
> @@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = {
>  	INTG(X86_TRAP_MF,		asm_exc_coprocessor_error),
>  	INTG(X86_TRAP_AC,		asm_exc_alignment_check),
>  	INTG(X86_TRAP_XF,		asm_exc_simd_coprocessor_error),
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
> +#endif
>  
>  #ifdef CONFIG_X86_32
>  	TSKG(X86_TRAP_DF,		GDT_ENTRY_DOUBLEFAULT_TSS),
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index b63275db1db9..ccfcb07bfb2c 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -82,6 +82,44 @@ static void tdg_get_info(void)
>  	td_info.attributes = out.rdx;
>  }
>  
> +unsigned long tdg_get_ve_info(struct ve_info *ve)
> +{
> +	u64 ret;
> +	struct tdcall_output out = {0};
> +
> +	/*
> +	 * The #VE cannot be nested before TDGETVEINFO is called,
> +	 * if there is any reason for it to nest the TD would shut
> +	 * down. The TDX module guarantees that no NMIs (or #MC or
> +	 * similar) can happen in this window. After TDGETVEINFO
> +	 * the #VE handler can nest if needed, although we don’t
> +	 * expect it to happen normally.
> +	 */

I find that description a bit unsatisfying.  Could we make this a bit
more concrete?  By the way, what about *normal* interrupts?

Maybe we should talk about this in terms of *rules* that folks need to
follow.  Maybe:

	NMIs and machine checks are suppressed.  Before this point any
	#VE is fatal.  After this point, NMIs and additional #VEs are
	permitted.

> +	ret = __tdcall(TDGETVEINFO, 0, 0, 0, 0, &out);
> +
> +	ve->exit_reason = out.rcx;
> +	ve->exit_qual   = out.rdx;
> +	ve->gla         = out.r8;
> +	ve->gpa         = out.r9;
> +	ve->instr_len   = out.r10 & UINT_MAX;
> +	ve->instr_info  = out.r10 >> 32;
> +
> +	return ret;
> +}
> +
> +int tdg_handle_virtualization_exception(struct pt_regs *regs,
> +		struct ve_info *ve)
> +{
> +	/*
> +	 * TODO: Add handler support for various #VE exit
> +	 * reasons. It will be added by other patches in
> +	 * the series.
> +	 */
> +	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> +	return -EFAULT;
> +}
> +
>  void __init tdx_early_init(void)
>  {
>  	if (!cpuid_has_tdx_guest())
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 213d4aa8e337..64869aa88a5a 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -61,6 +61,7 @@
>  #include <asm/insn.h>
>  #include <asm/insn-eval.h>
>  #include <asm/vdso.h>
> +#include <asm/tdx.h>
>  
>  #ifdef CONFIG_X86_64
>  #include <asm/x86_init.h>
> @@ -1140,6 +1141,35 @@ DEFINE_IDTENTRY(exc_device_not_available)
>  	}
>  }
>  
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +DEFINE_IDTENTRY(exc_virtualization_exception)
> +{
> +	struct ve_info ve;
> +	int ret;
> +
> +	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> +
> +	/*
> +	 * Consume #VE info before re-enabling interrupts. It will be
> +	 * re-enabled after executing the TDGETVEINFO TDCALL.
> +	 */

"It" is nebulous here.  Is this talking about NMIs, or the
cond_local_irq_enable() that is "after" TDGETVEINFO?

> +	ret = tdg_get_ve_info(&ve);
> +
> +	cond_local_irq_enable(regs);
> +
> +	if (!ret)
> +		ret = tdg_handle_virtualization_exception(regs, &ve);
> +	/*
> +	 * If tdg_handle_virtualization_exception() could not process
> +	 * it successfully, treat it as #GP(0) and handle it.
> +	 */
> +	if (ret)
> +		do_general_protection(regs, 0);
> +
> +	cond_local_irq_disable(regs);
> +}
> +#endif
> +
>  #ifdef CONFIG_X86_32
>  DEFINE_IDTENTRY_SW(iret_error)
>  {
> 


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-04-26 18:01 ` [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls Kuppuswamy Sathyanarayanan
@ 2021-05-07 21:46   ` Dave Hansen
  2021-05-08  0:59     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-07 21:46 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> KVM hypercalls have to be wrapped into vendor-specific TDVMCALLs.

How about:

KVM hypercalls use the "vmcall" or "vmmcall" instructions.  Although the
ABI is similar, those instructions no longer function for TDX guests.
Make TDVMCALLs instead of VMCALL/VMCALL.

> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 338119852512..2fa85481520b 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -6,6 +6,7 @@
>  #include <asm/alternative.h>
>  #include <linux/interrupt.h>
>  #include <uapi/asm/kvm_para.h>
> +#include <asm/tdx.h>
>  
>  extern void kvmclock_init(void);
>  
> @@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
>  static inline long kvm_hypercall0(unsigned int nr)
>  {
>  	long ret;
> +
> +	if (is_tdx_guest())
> +		return tdx_kvm_hypercall0(nr);

... all of these look OK.

>  #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> index 81af70c2acbd..964bfd7fc682 100644
> --- a/arch/x86/kernel/tdcall.S
> +++ b/arch/x86/kernel/tdcall.S
> @@ -11,6 +11,7 @@
>   * refer to TDX GHCI specification).
>   */
>  #define TDVMCALL_EXPOSE_REGS_MASK	0xfc00
> +#define TDVMCALL_VENDOR_KVM		0x4d564b2e584454 /* "TDX.KVM" */
>  
>  /*
>   * TDX guests use the TDCALL instruction to make
> @@ -198,3 +199,9 @@ SYM_FUNC_START(__tdvmcall)
>  	call do_tdvmcall
>  	retq
>  SYM_FUNC_END(__tdvmcall)
> +
> +SYM_FUNC_START(__tdvmcall_vendor_kvm)
> +	movq $TDVMCALL_VENDOR_KVM, %r10
> +	call do_tdvmcall
> +	retq
> +SYM_FUNC_END(__tdvmcall_vendor_kvm)

Granted, this is not a ton of assembly.  But, it does look a bit weird.
 It needs a comment and/or a mention in the changelog.

R10 is not part of the function call ABI, but it is a part of the
TDVMCALL ABI.  This little assembly wrapper lets us reuse do_tdvmcall()
for both KVM-specific hypercalls TDVMCALL_VENDOR_KVM and the more
generic __tdvmcalls.

> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -8,6 +8,10 @@
>  
>  #include <linux/cpu.h>
>  
> +#ifdef CONFIG_KVM_GUEST
> +#include "tdx-kvm.c"
> +#endif
> +
>  static struct {
>  	unsigned int gpa_width;
>  	unsigned long attributes;

I know KVM does weird stuff.  But, this is *really* weird.  Why are we
#including a .c file into another .c file?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 15/32] x86/tdx: Handle in-kernel MMIO
  2021-04-26 18:01 ` [RFC v2 15/32] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
@ 2021-05-07 21:52   ` Dave Hansen
  2021-05-18  0:48     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
  2021-06-02 19:42     ` [RFC v2-fix-v2 0/2] " Kuppuswamy Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-07 21:52 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
> exit reason.

This needs a bit of a history lesson.  "In traditional VMs, MMIO tends
to be implemented by giving a guest access to an mapping which will
cause a VMEXIT on access.  That's not possible in a TDX guest..."

> For now we only handle subset of instruction that kernel uses for MMIO
> oerations. User-space access triggers SIGBUS.

I still don't think that TDX guests should be doing things that they
*KNOW* will cause #VE, including MMIO.  I really want to hear a more
discrete story about why this is the *best* way to do this for Linux
instead of just a hack from the Windows binary driver ecosystem that
seemed expedient.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-04-26 18:01 ` [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code Kuppuswamy Sathyanarayanan
@ 2021-05-07 21:54   ` Dave Hansen
  2021-05-10 22:19     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-07 21:54 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

> +++ b/arch/x86/mm/mem_encrypt_common.c
...
> +/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
> +bool force_dma_unencrypted(struct device *dev)
> +{
> +	/*
> +	 * For SEV, all DMA must be to unencrypted/shared addresses.
> +	 */
> +	if (sev_active())
> +		return true;
> +
> +	/*
> +	 * For SME, all DMA must be to unencrypted addresses if the
> +	 * device does not support DMA to addresses that include the
> +	 * encryption mask.
> +	 */
> +	if (sme_active()) {
> +		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
> +		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
> +						dev->bus_dma_limit);
> +
> +		if (dma_dev_mask <= dma_enc_mask)
> +			return true;
> +	}
> +
> +	return false;
> +}

This doesn't seem much like common code to me.  It seems like 100% SEV
code.  Is this really where we want to move it?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-04-26 18:01 ` [RFC v2 28/32] x86/tdx: Make pages shared in ioremap() Kuppuswamy Sathyanarayanan
@ 2021-05-07 21:55   ` Dave Hansen
  2021-05-07 22:38     ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-07 21:55 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>  static unsigned int __ioremap_check_encrypted(struct resource *res)
>  {
> -	if (!sev_active())
> +	if (!sev_active() && !is_tdx_guest())
>  		return 0;

I think it's time to come up with a real name for all of the code that's
under: (sev_active() || is_tdx_guest()).

"encrypted" isn't it, for sure.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-07 21:55   ` Dave Hansen
@ 2021-05-07 22:38     ` Andi Kleen
  2021-05-10 22:23       ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-07 22:38 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel


On 5/7/2021 2:55 PM, Dave Hansen wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>>   static unsigned int __ioremap_check_encrypted(struct resource *res)
>>   {
>> -	if (!sev_active())
>> +	if (!sev_active() && !is_tdx_guest())
>>   		return 0;
> I think it's time to come up with a real name for all of the code that's
> under: (sev_active() || is_tdx_guest()).
>
> "encrypted" isn't it, for sure.

I called it protected_guest() in some other patches.

-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 32/32] x86/tdx: ioapic: Add shared bit for IOAPIC base address
  2021-04-26 18:01 ` [RFC v2 32/32] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kuppuswamy Sathyanarayanan
@ 2021-05-07 23:06   ` Dave Hansen
  2021-05-24 23:29     ` [RFC v2-fix-v2 1/1] " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-07 23:06 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> IOAPIC is emulated by KVM which means its MMIO address is shared
> by host. Add shared bit for base address of IOAPIC.
> Most MMIO region is handled by ioremap which is already marked
> as shared for TDX guest platform, but IOAPIC is an exception which
> uses fixed map.

Ho hum...  I guess I'll rewrite the changelog:

The kernel interacts with each bare-metal IOAPIC with a special MMIO
page.  When running under KVM, the guest's IOAPICs are emulated by KVM.

When running as a TDX guest, the guest needs to mark each IOAPIC mapping
as "shared" with the host.  This ensures that TDX private protections
are not applied to the page, which allows the TDX host emulation to work.

Earlier patches in this series modified ioremap() so that
ioremap()-created mappings such as virtio will be marked as shared.
However, the IOAPIC code does not use ioremap() and instead uses the
fixmap mechanism.

Introduce a special fixmap helper just for the IOAPIC code.  Ensure that
it marks IOAPIC pages as "shared".  This replaces set_fixmap_nocache()
with __set_fixmap() since __set_fixmap() allows custom 'prot' values.

>  arch/x86/kernel/apic/io_apic.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> index 73ff4dd426a8..2a01d4a82be7 100644
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -2675,6 +2675,14 @@ static struct resource * __init ioapic_setup_resources(void)
>  	return res;
>  }
>  
> +static void io_apic_set_fixmap_nocache(enum fixed_addresses idx, phys_addr_t phys)
> +{
> +	pgprot_t flags = FIXMAP_PAGE_NOCACHE;
> +	if (is_tdx_guest())
> +		flags = pgprot_tdg_shared(flags);
> +	__set_fixmap(idx, phys, flags);
> +}

^ This seems like it could at least use a one-liner comment.

>  void __init io_apic_init_mappings(void)
>  {
>  	unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
> @@ -2707,7 +2715,7 @@ void __init io_apic_init_mappings(void)
>  				      __func__, PAGE_SIZE, PAGE_SIZE);
>  			ioapic_phys = __pa(ioapic_phys);
>  		}
> -		set_fixmap_nocache(idx, ioapic_phys);
> +		io_apic_set_fixmap_nocache(idx, ioapic_phys);
>  		apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
>  			__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
>  			ioapic_phys);
> @@ -2836,7 +2844,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
>  	ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
>  	ioapics[idx].mp_config.apicaddr = address;
>  
> -	set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
> +	io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
>  	if (bad_ioapic_register(idx)) {
>  		clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
>  		return -ENODEV;
> 


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-05-07 21:46   ` Dave Hansen
@ 2021-05-08  0:59     ` Kuppuswamy, Sathyanarayanan
  2021-05-12 13:00       ` Kirill A. Shutemov
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-08  0:59 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata



On 5/7/21 2:46 PM, Dave Hansen wrote:
> I know KVM does weird stuff.  But, this is*really*  weird.  Why are we
> #including a .c file into another .c file?

I think Kirill implemented it this way to skip Makefile changes for it. I don't
see any other KVM direct dependencies in tdx.c.

I will fix it in next version.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-04-27 17:31   ` Borislav Petkov
  2021-05-06 14:59     ` Kirill A. Shutemov
@ 2021-05-10  8:07     ` Juergen Gross
  2021-05-10 15:52       ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Juergen Gross @ 2021-05-10  8:07 UTC (permalink / raw)
  To: Borislav Petkov, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel


[-- Attachment #1.1.1: Type: text/plain, Size: 1016 bytes --]

On 27.04.21 19:31, Borislav Petkov wrote:
> + Jürgen.
> 
> On Mon, Apr 26, 2021 at 11:01:28AM -0700, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> Split off halt paravirt calls from CONFIG_PARAVIRT_XXL into
>> a separate config option. It provides a middle ground for
>> not-so-deep paravirtulized environments.
> 
> Please introduce a spellchecker into your patch creation workflow.
> 
> Also, what does "not-so-deep" mean?
> 
>> CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
>> calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
>> config would be a bloat for TDX.
> 
> Used how? Why is it bloat for TDX?

Is there any major downside to move the halt related pvops functions
from CONFIG_PARAVIRT_XXL to CONFIG_PARAVIRT?

I'd rather introduce a new PARAVIRT level only in case of multiple
pvops functions needed for a new guest type, or if a real hot path
would be affected.


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-05-10  8:07     ` Juergen Gross
@ 2021-05-10 15:52       ` Andi Kleen
  2021-05-10 15:56         ` Juergen Gross
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-10 15:52 UTC (permalink / raw)
  To: Juergen Gross, Borislav Petkov, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

\
>>> CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
>>> calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
>>> config would be a bloat for TDX.
>>
>> Used how? Why is it bloat for TDX?
>
> Is there any major downside to move the halt related pvops functions
> from CONFIG_PARAVIRT_XXL to CONFIG_PARAVIRT?

I think the main motivation is to get rid of all the page table related 
hooks for modern configurations. These are the bulk of the annotations 
and  cause bloat and worse code. Shadow page tables are really obscure 
these days and very few people still need them and it's totally 
reasonable to build even widely used distribution kernels without them. 
On contrast most of the other hooks are comparatively few and also on 
comparatively slow paths, so don't really matter too much.

I think it would be ok to have a CONFIG_PARAVIRT that does not have page 
table support, and a separate config option for those (that could be 
eventually deprecated).

But that would break existing .configs for those shadow stack users, 
that's why I think Kirill did it the other way around.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-05-10 15:52       ` Andi Kleen
@ 2021-05-10 15:56         ` Juergen Gross
  2021-05-12 12:07           ` Kirill A. Shutemov
  2021-05-12 13:18           ` Peter Zijlstra
  0 siblings, 2 replies; 381+ messages in thread
From: Juergen Gross @ 2021-05-10 15:56 UTC (permalink / raw)
  To: Andi Kleen, Borislav Petkov, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel


[-- Attachment #1.1.1: Type: text/plain, Size: 1500 bytes --]

On 10.05.21 17:52, Andi Kleen wrote:
> \
>>>> CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
>>>> calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
>>>> config would be a bloat for TDX.
>>>
>>> Used how? Why is it bloat for TDX?
>>
>> Is there any major downside to move the halt related pvops functions
>> from CONFIG_PARAVIRT_XXL to CONFIG_PARAVIRT?
> 
> I think the main motivation is to get rid of all the page table related 
> hooks for modern configurations. These are the bulk of the annotations 
> and  cause bloat and worse code. Shadow page tables are really obscure 
> these days and very few people still need them and it's totally 
> reasonable to build even widely used distribution kernels without them. 
> On contrast most of the other hooks are comparatively few and also on 
> comparatively slow paths, so don't really matter too much.
> 
> I think it would be ok to have a CONFIG_PARAVIRT that does not have page 
> table support, and a separate config option for those (that could be 
> eventually deprecated).
> 
> But that would break existing .configs for those shadow stack users, 
> that's why I think Kirill did it the other way around.

No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
other hypervisor's guests, supporting basically the TLB flush operations
and time related operations only. Adding the halt related operations to
PARAVIRT wouldn't break anything.


Juergen


[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-04-26 18:01 ` [RFC v2 14/32] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
@ 2021-05-10 21:57   ` Dan Williams
  2021-05-10 23:08     ` Andi Kleen
  2021-05-11 15:35     ` Dave Hansen
  0 siblings, 2 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-10 21:57 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Mon, Apr 26, 2021 at 11:02 AM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

While I do not expect that a patch in the middle of a series needs the
full introduction of all concepts, the high expectations of  the
reader's context in this changelog make the patch actively painful to
read.

Some connective tissue commentary to list assumptions and pointers to
definitions is needed to make this patch stand alone when a future
bisect lands on it and someone wonders where to get started debugging
it.

> Unroll string operations and handle port I/O through TDVMCALLs.
> Also handle #VE due to I/O operations with the same TDVMCALLs.

There is a mix of direct-TDVMCALL usage and handling #VE when and why
is either approached used?

> Decompression code uses port IO for earlyprintk. We must use
> paravirt calls there too if we want to allow earlyprintk.

What is the tradeoff between teaching the decompression code to handle
#VE (the implied assumption) vs teaching it to avoid #VE with direct
TDVMCALLs (the chosen direction)?

Rewrite without "we":

"Given the need to support earlyprintk for protected guests, deploy
paravirt calls for the io*() and out*() usage in the decompress code."

This raises the question of why the cover letter switched from
explicitly saying TDVMCALL to "paravirt" where it could be confused
with the typical paravirt helpers?

>
> Decompresion code cannot deal with alternatives: use branches

s/Decompresion/Decompression/

> instead to implement inX() and outX() helpers.
>
> Since we use call instruction in place of in/out instruction,
> the argument passed to call instruction has to be in a
> register, it cannot be an immediate value like in/out
> instruction. So change constraint flag from "Nd" to "d"

Rewrite without "we":

With the approach to use a "call" instruction as an alternative for an
"in/out" instruction it is no longer the case that the argument can be
an immediate value. Change the asm constraint flag from "Nd" to "d" to
accomodate.

>
> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/boot/compressed/Makefile |   1 +
>  arch/x86/boot/compressed/tdcall.S |   9 ++
>  arch/x86/include/asm/io.h         |   5 +-
>  arch/x86/include/asm/tdx.h        |  46 ++++++++-
>  arch/x86/kernel/tdcall.S          | 154 ++++++++++++++++++++++++++++++

Why is this named "tdcall" when it is implementing tdvmcalls? I must
say those names don't really help me understand what they do. Can we
have Linux names that don't mandate keeping the spec terminology in my
brain's translation cache?

>  arch/x86/kernel/tdx.c             |  33 +++++++
>  6 files changed, 245 insertions(+), 3 deletions(-)
>  create mode 100644 arch/x86/boot/compressed/tdcall.S
>
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index a2554621cefe..a944a2038797 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -97,6 +97,7 @@ endif
>
>  vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
>  vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
> +vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdcall.o
>
>  vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
>  efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
> diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
> new file mode 100644
> index 000000000000..5ebb80d45ad8
> --- /dev/null
> +++ b/arch/x86/boot/compressed/tdcall.S
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include <asm/export.h>
> +
> +/* Do not export symbols in decompression code */
> +#undef EXPORT_SYMBOL
> +#define EXPORT_SYMBOL(sym)

What's wrong with the existing:

KBUILD_CFLAGS += -D__DISABLE_EXPORTS

...in arch/x86/boot/compressed/Makefile?

> +
> +#include "../../kernel/tdcall.S"
> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> index ef7a686a55a9..30a3b30395ad 100644
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -43,6 +43,7 @@
>  #include <asm/page.h>
>  #include <asm/early_ioremap.h>
>  #include <asm/pgtable_types.h>
> +#include <asm/tdx.h>
>
>  #define build_mmio_read(name, size, type, reg, barrier) \
>  static inline type name(const volatile void __iomem *addr) \
> @@ -309,7 +310,7 @@ static inline unsigned type in##bwl##_p(int port)                   \
>                                                                         \
>  static inline void outs##bwl(int port, const void *addr, unsigned long count) \
>  {                                                                      \
> -       if (sev_key_active()) {                                         \
> +       if (sev_key_active() || is_tdx_guest()) {                       \

Is there a unified Linux name these can be given to stop the
proliferation of poor vendor names for similar concepts?

That routine

>                 unsigned type *value = (unsigned type *)addr;           \
>                 while (count) {                                         \
>                         out##bwl(*value, port);                         \
> @@ -325,7 +326,7 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
>                                                                         \
>  static inline void ins##bwl(int port, void *addr, unsigned long count) \
>  {                                                                      \
> -       if (sev_key_active()) {                                         \
> +       if (sev_key_active() || is_tdx_guest()) {                       \
>                 unsigned type *value = (unsigned type *)addr;           \
>                 while (count) {                                         \
>                         *value = in##bwl(port);                         \
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index e0b3ed9e262c..b972c6531a53 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -5,6 +5,8 @@
>
>  #define TDX_CPUID_LEAF_ID      0x21
>
> +#ifndef __ASSEMBLY__
> +
>  #ifdef CONFIG_INTEL_TDX_GUEST
>
>  #include <asm/cpufeature.h>
> @@ -67,6 +69,48 @@ long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
>  long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
>                 unsigned long p3, unsigned long p4);
>
> +/* Decompression code doesn't know how to handle alternatives */

Does it also not know how to handle #VE to keep it aligned with the
runtime code?

> +#ifdef BOOT_COMPRESSED_MISC_H
> +#define __out(bwl, bw)                                                 \
> +do {                                                                   \
> +       if (is_tdx_guest()) {                                           \
> +               asm volatile("call tdg_out" #bwl : :                    \
> +                               "a"(value), "d"(port));                 \
> +       } else {                                                        \
> +               asm volatile("out" #bwl " %" #bw "0, %w1" : :           \
> +                               "a"(value), "Nd"(port));                \
> +       }                                                               \
> +} while (0)
> +#define __in(bwl, bw)                                                  \
> +do {                                                                   \
> +       if (is_tdx_guest()) {                                           \
> +               asm volatile("call tdg_in" #bwl :                       \
> +                               "=a"(value) : "d"(port));               \
> +       } else {                                                        \
> +               asm volatile("in" #bwl " %w1, %" #bw "0" :              \
> +                               "=a"(value) : "Nd"(port));              \
> +       }                                                               \
> +} while (0)
> +#else
> +#define __out(bwl, bw)                                                 \
> +       alternative_input("out" #bwl " %" #bw "1, %w2",                 \
> +                       "call tdg_out" #bwl, X86_FEATURE_TDX_GUEST,     \
> +                       "a"(value), "d"(port))
> +
> +#define __in(bwl, bw)                                                  \
> +       alternative_io("in" #bwl " %w2, %" #bw "0",                     \
> +                       "call tdg_in" #bwl, X86_FEATURE_TDX_GUEST,      \
> +                       "=a"(value), "d"(port))

Outside the boot decompression code isn't this branch of the "ifdef
BOOT_COMPRESSED_MISC_H"  handled by #VE? I also don't see any usage of
__{in,out}() in this patch.

> +#endif
> +
> +void tdg_outb(unsigned char value, unsigned short port);
> +void tdg_outw(unsigned short value, unsigned short port);
> +void tdg_outl(unsigned int value, unsigned short port);
> +
> +unsigned char tdg_inb(unsigned short port);
> +unsigned short tdg_inw(unsigned short port);
> +unsigned int tdg_inl(unsigned short port);
> +
>  #else // !CONFIG_INTEL_TDX_GUEST
>
>  static inline bool is_tdx_guest(void)
> @@ -106,5 +150,5 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
>  }
>
>  #endif /* CONFIG_INTEL_TDX_GUEST */
> -
> +#endif /* __ASSEMBLY__ */
>  #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> index 964bfd7fc682..df4159bb5103 100644
> --- a/arch/x86/kernel/tdcall.S
> +++ b/arch/x86/kernel/tdcall.S
> @@ -3,6 +3,7 @@
>  #include <asm/asm.h>
>  #include <asm/frame.h>
>  #include <asm/unwind_hints.h>
> +#include <asm/export.h>
>
>  #include <linux/linkage.h>
>
> @@ -12,6 +13,12 @@
>   */
>  #define TDVMCALL_EXPOSE_REGS_MASK      0xfc00
>  #define TDVMCALL_VENDOR_KVM            0x4d564b2e584454 /* "TDX.KVM" */
> +#define EXIT_REASON_IO_INSTRUCTION     30
> +/*
> + * Current size of struct tdvmcall_output is 40 bytes,
> + * but allocate double to account future changes.

What future changes? Why could they not be handled as future code changes?

> + */
> +#define TDVMCALL_OUTPUT_SIZE           80

Perhaps "PAYLOAD_SIZE" since it is used for both input and output?

If the ABI does not include the size of the payload then how would
code detect if even 80 bytes was violated in the future?

>
>  /*
>   * TDX guests use the TDCALL instruction to make
> @@ -205,3 +212,150 @@ SYM_FUNC_START(__tdvmcall_vendor_kvm)
>         call do_tdvmcall
>         retq
>  SYM_FUNC_END(__tdvmcall_vendor_kvm)
> +
> +.macro io_save_registers
> +       push %rbp
> +       push %rbx
> +       push %rcx
> +       push %rdx
> +       push %rdi
> +       push %rsi
> +       push %r8
> +       push %r9
> +       push %r10
> +       push %r11
> +       push %r12
> +       push %r13
> +       push %r14
> +       push %r15

Surely there's an existing macro for this pattern? Would
PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
would eliminate clearing of %r8.

> +.endm
> +.macro io_restore_registers
> +       pop %r15
> +       pop %r14
> +       pop %r13
> +       pop %r12
> +       pop %r11
> +       pop %r10
> +       pop %r9
> +       pop %r8
> +       pop %rsi
> +       pop %rdi
> +       pop %rdx
> +       pop %rcx
> +       pop %rbx
> +       pop %rbp
> +.endm
> +
> +/*
> + * tdg_out{b,w,l}()  - Write given data to the specified port.
> + *
> + * @arg1 (RAX)       - Value to be written (passed via R8 to do_tdvmcall()).
> + * @arg2 (RDX)       - Port id (passed via RCX to do_tdvmcall()).
> + *
> + */
> +SYM_FUNC_START(tdg_outb)
> +       io_save_registers
> +       xor %r8, %r8
> +       /* Move data to R8 register */
> +       mov %al, %r8b
> +       /* Set data width to 1 byte */
> +       mov $1, %rsi
> +       jmp 1f
> +
> +SYM_FUNC_START(tdg_outw)
> +       io_save_registers
> +       xor %r8, %r8
> +       /* Move data to R8 register */
> +       mov %ax, %r8w
> +       /* Set data width to 2 bytes */
> +       mov $2, %rsi
> +       jmp 1f
> +
> +SYM_FUNC_START(tdg_outl)
> +       io_save_registers
> +       xor %r8, %r8
> +       /* Move data to R8 register */
> +       mov %eax, %r8d
> +       /* Set data width to 4 bytes */
> +       mov $4, %rsi
> +1:
> +       /*
> +        * Since io_save_registers does not save rax
> +        * state, save it here so that we can preserve
> +        * the caller register state.
> +        */
> +       push %rax
> +
> +       mov %rdx, %rcx
> +       /* Set 1 in RDX to select out operation */
> +       mov $1, %rdx
> +       /* Set TDVMCALL function id in RDI */
> +       mov $EXIT_REASON_IO_INSTRUCTION, %rdi
> +       /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
> +       xor %r10, %r10
> +       /* Since we don't use tdvmcall output, set it to NULL */
> +       xor %r9, %r9
> +
> +       call do_tdvmcall
> +
> +       pop %rax
> +       io_restore_registers
> +       ret
> +SYM_FUNC_END(tdg_outb)
> +SYM_FUNC_END(tdg_outw)
> +SYM_FUNC_END(tdg_outl)
> +EXPORT_SYMBOL(tdg_outb)
> +EXPORT_SYMBOL(tdg_outw)
> +EXPORT_SYMBOL(tdg_outl)
> +
> +/*
> + * tdg_in{b,w,l}()   - Read data to the specified port.
> + *
> + * @arg1 (RDX)       - Port id (passed via RCX to do_tdvmcall()).
> + *
> + * Returns data read via RAX register.
> + *
> + */
> +SYM_FUNC_START(tdg_inb)
> +       io_save_registers
> +       /* Set data width to 1 byte */
> +       mov $1, %rsi
> +       jmp 1f
> +
> +SYM_FUNC_START(tdg_inw)
> +       io_save_registers
> +       /* Set data width to 2 bytes */
> +       mov $2, %rsi
> +       jmp 1f
> +
> +SYM_FUNC_START(tdg_inl)
> +       io_save_registers
> +       /* Set data width to 4 bytes */
> +       mov $4, %rsi
> +1:
> +       mov %rdx, %rcx
> +       /* Set 0 in RDX to select in operation */
> +       mov $0, %rdx
> +       /* Set TDVMCALL function id in RDI */
> +       mov $EXIT_REASON_IO_INSTRUCTION, %rdi
> +       /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
> +       xor %r10, %r10
> +       /* Allocate memory in stack for Output */
> +       subq $TDVMCALL_OUTPUT_SIZE, %rsp

Why is this leaf function responsibility? I would expect the core
do_tdvmcall (or whatever it is renamed to) helper to hide output
buffer payload handling. tdg_in* only wants 1, 2, or 4 bytes, not 40
bytes of payload to handle.

> +       /* Move tdvmcall_output pointer to R9 */
> +       movq %rsp, %r9
> +
> +       call do_tdvmcall
> +
> +       /* Move data read from port to RAX */
> +       mov TDVMCALL_r11(%r9), %eax

"TDVMCALL_r11" is unreadable, what is that doing?

Shouldn't failed in* calls signal failure with an all ones result?

> +       /* Free allocated memory */
> +       addq $TDVMCALL_OUTPUT_SIZE, %rsp
> +       io_restore_registers
> +       ret
> +SYM_FUNC_END(tdg_inb)
> +SYM_FUNC_END(tdg_inw)
> +SYM_FUNC_END(tdg_inl)
> +EXPORT_SYMBOL(tdg_inb)
> +EXPORT_SYMBOL(tdg_inw)
> +EXPORT_SYMBOL(tdg_inl)
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index e42e260df245..ec61f2f06c98 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -189,6 +189,36 @@ static void tdg_handle_cpuid(struct pt_regs *regs)
>         regs->dx = out.r15;
>  }
>
> +static void tdg_out(int size, int port, unsigned int value)
> +{
> +       tdvmcall(EXIT_REASON_IO_INSTRUCTION, size, 1, port, value);
> +}
> +
> +static unsigned int tdg_in(int size, int port)
> +{
> +       return tdvmcall_out_r11(EXIT_REASON_IO_INSTRUCTION, size, 0, port, 0);
> +}
> +
> +static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
> +{
> +       bool string = exit_qual & 16;
> +       int out, size, port;
> +
> +       /* I/O strings ops are unrolled at build time. */
> +       BUG_ON(string);
> +
> +       out = (exit_qual & 8) ? 0 : 1;
> +       size = (exit_qual & 7) + 1;
> +       port = exit_qual >> 16;

This seems to be begging for exit_qual helpers to put symbolic names
on these operations.

> +
> +       if (out) {
> +               tdg_out(size, port, regs->ax);
> +       } else {
> +               regs->ax &= ~GENMASK(8 * size, 0);
> +               regs->ax |= tdg_in(size, port) & GENMASK(8 * size, 0);
> +       }
> +}
> +
>  unsigned long tdg_get_ve_info(struct ve_info *ve)
>  {
>         u64 ret;
> @@ -238,6 +268,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
>         case EXIT_REASON_CPUID:
>                 tdg_handle_cpuid(regs);
>                 break;
> +       case EXIT_REASON_IO_INSTRUCTION:
> +               tdg_handle_io(regs, ve->exit_qual);
> +               break;
>         default:
>                 pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
>                 return -EFAULT;
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-07 21:54   ` Dave Hansen
@ 2021-05-10 22:19     ` Kuppuswamy, Sathyanarayanan
  2021-05-10 22:23       ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-10 22:19 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 5/7/21 2:54 PM, Dave Hansen wrote:
> This doesn't seem much like common code to me.  It seems like 100% SEV
> code.  Is this really where we want to move it?

Both SEV and TDX code has requirement to enable
CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED and define force_dma_unencrypted()
function.

force_dma_unencrypted() is modified by patch titled "x86/tdx: Make DMA
pages shared" to add TDX guest specific support.

Since both SEV and TDX code uses it, its moved to common file.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-07 22:38     ` Andi Kleen
@ 2021-05-10 22:23       ` Kuppuswamy, Sathyanarayanan
  2021-05-10 22:30         ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-10 22:23 UTC (permalink / raw)
  To: Andi Kleen, Dave Hansen, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel

Hi Dave,

On 5/7/21 3:38 PM, Andi Kleen wrote:
> 
> On 5/7/2021 2:55 PM, Dave Hansen wrote:
>> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>>>   static unsigned int __ioremap_check_encrypted(struct resource *res)
>>>   {
>>> -    if (!sev_active())
>>> +    if (!sev_active() && !is_tdx_guest())
>>>           return 0;
>> I think it's time to come up with a real name for all of the code that's
>> under: (sev_active() || is_tdx_guest()).
>>
>> "encrypted" isn't it, for sure.
> 
> I called it protected_guest() in some other patches.

If you are also fine with above mentioned function name, I can include it
in this series. Since we have many use cases of above condition, it will
be useful define it as helper function.

> 
> -Andi
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-10 22:19     ` Kuppuswamy, Sathyanarayanan
@ 2021-05-10 22:23       ` Dave Hansen
  2021-05-12 13:08         ` Kirill A. Shutemov
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-10 22:23 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 5/10/21 3:19 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 5/7/21 2:54 PM, Dave Hansen wrote:
>> This doesn't seem much like common code to me.  It seems like 100% SEV
>> code.  Is this really where we want to move it?
> 
> Both SEV and TDX code has requirement to enable
> CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED and define force_dma_unencrypted()
> function.
> 
> force_dma_unencrypted() is modified by patch titled "x86/tdx: Make DMA
> pages shared" to add TDX guest specific support.
> 
> Since both SEV and TDX code uses it, its moved to common file.

That's not an excuse to have a bunch of AMD (or Intel) feature-specific
code in a file named "common".  I'd make an attempt to keep them
separate and then call into the two separate functions *from* the common
function.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-10 22:23       ` Kuppuswamy, Sathyanarayanan
@ 2021-05-10 22:30         ` Dave Hansen
  2021-05-10 22:52           ` Sean Christopherson
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-10 22:30 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Andi Kleen, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel

On 5/10/21 3:23 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>>
>>>> -    if (!sev_active())
>>>> +    if (!sev_active() && !is_tdx_guest())
>>>>           return 0;
>>> I think it's time to come up with a real name for all of the code that's
>>> under: (sev_active() || is_tdx_guest()).
>>>
>>> "encrypted" isn't it, for sure.
>>
>> I called it protected_guest() in some other patches.
> 
> If you are also fine with above mentioned function name, I can include it
> in this series. Since we have many use cases of above condition, it will
> be useful define it as helper function.

FWIW, I think sev_active() has a horrible name.  Shouldn't that be
"is_sev_guest()"?  "sev_active()" could be read as "I'm a SEV host" or
"I'm a SEV guest" and "SEV is active".

protected_guest() seems fine to cover both, despite the horrid SEV
naming.  It'll actually be nice to banish it from appearing in many of
its uses. :)


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-10 22:30         ` Dave Hansen
@ 2021-05-10 22:52           ` Sean Christopherson
  2021-05-11  9:35             ` Borislav Petkov
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-10 22:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy, Sathyanarayanan, Andi Kleen, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel,
	Borislav Petkov

+Boris, who has similar opinions on sev_active().

On Mon, May 10, 2021, Dave Hansen wrote:
> On 5/10/21 3:23 PM, Kuppuswamy, Sathyanarayanan wrote:
> >>>>
> >>>> -    if (!sev_active())
> >>>> +    if (!sev_active() && !is_tdx_guest())
> >>>>           return 0;
> >>> I think it's time to come up with a real name for all of the code that's
> >>> under: (sev_active() || is_tdx_guest()).
> >>>
> >>> "encrypted" isn't it, for sure.
> >>
> >> I called it protected_guest() in some other patches.
> > 
> > If you are also fine with above mentioned function name, I can include it
> > in this series. Since we have many use cases of above condition, it will
> > be useful define it as helper function.
> 
> FWIW, I think sev_active() has a horrible name.  Shouldn't that be
> "is_sev_guest()"?  "sev_active()" could be read as "I'm a SEV host" or
> "I'm a SEV guest" and "SEV is active".

I can't find the thread offhand, but Boris proposed something along the lines of
cpu_has(), but specific to a given flavor of protected guest.  IIRC, it was
sev_guest_has(SEV_ES) or something like that.

I 100% agree that we should have actual feature bits somewhere for the various
protected guest flavors.

> protected_guest() seems fine to cover both, despite the horrid SEV
> naming.  It'll actually be nice to banish it from appearing in many of
> its uses. :)

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-10 21:57   ` Dan Williams
@ 2021-05-10 23:08     ` Andi Kleen
  2021-05-10 23:34       ` Dan Williams
  2021-05-11 15:35     ` Dave Hansen
  1 sibling, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-10 23:08 UTC (permalink / raw)
  To: Dan Williams, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List


On 5/10/2021 2:57 PM, Dan Williams wrote:
>
> There is a mix of direct-TDVMCALL usage and handling #VE when and why
> is either approached used?

For the really early code in the decompressor or the main kernel we 
can't use #VE because the IDT needed for handling the exception is not 
set up, and some other infrastructure needed by the handler is missing. 
The early code needs to do port IO to be able to write the early serial 
console. To keep it all common it ended up that all port IO is paravirt. 
Actually for most the main kernel port IO calls we could just use #VE 
and it would result in smaller binaries, but then we would need to 
annotate all early portio with some special name. That's why port IO is 
all TDCALL.

For some others the only thing that really has to be #VE is MMIO because 
we don't want to annotate every MMIO read*/write* with an alternative 
(which would result in incredible binary bloat) For the others they have 
mostly become now direct calls.


>
>> Decompression code uses port IO for earlyprintk. We must use
>> paravirt calls there too if we want to allow earlyprintk.
> What is the tradeoff between teaching the decompression code to handle
> #VE (the implied assumption) vs teaching it to avoid #VE with direct
> TDVMCALLs (the chosen direction)?

The decompression code only really needs it to output something. But you 
couldn't debug anything until #VE is set up. Also the decompression code 
has a very basic environment that doesn't supply most kernel services, 
and the #VE handler is relatively complicated. It would probably need to 
be duplicated and the instruction decoder be ported to work in this 
environment. It would be all a lot of work, just to make the debug 
output work.

>
>> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>> ---
>>   arch/x86/boot/compressed/Makefile |   1 +
>>   arch/x86/boot/compressed/tdcall.S |   9 ++
>>   arch/x86/include/asm/io.h         |   5 +-
>>   arch/x86/include/asm/tdx.h        |  46 ++++++++-
>>   arch/x86/kernel/tdcall.S          | 154 ++++++++++++++++++++++++++++++
> Why is this named "tdcall" when it is implementing tdvmcalls? I must
> say those names don't really help me understand what they do. Can we
> have Linux names that don't mandate keeping the spec terminology in my
> brain's translation cache?

The instruction is called TDCALL. It's always the same instruction

TDVMCALL is the variant when the host processes it (as opposed to the 
TDX module), but it's just a different name space in the call number.


             \

> Is there a unified Linux name these can be given to stop the
> proliferation of poor vendor names for similar concepts?

We could use protected_guest()


>
> Does it also not know how to handle #VE to keep it aligned with the
> runtime code?


Not sure I understand the question, but the decompression code supports 
neither alternatives nor #VE. It's a very limited environment.

>
> Outside the boot decompression code isn't this branch of the "ifdef
> BOOT_COMPRESSED_MISC_H"  handled by #VE? I also don't see any usage of
> __{in,out}() in this patch.

I thought it was all alternative after decompression, so the #VE code 
shouldn't be called. We still have it for some reason though.


>
> Perhaps "PAYLOAD_SIZE" since it is used for both input and output?
>
> If the ABI does not include the size of the payload then how would
> code detect if even 80 bytes was violated in the future?


The payload in memory is just a Linux concept. At the TDCALL level it's 
only registers.


>
> 5
> Surely there's an existing macro for this pattern? Would
> PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
> would eliminate clearing of %r8.


There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in 
some past refactorings.


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-10 23:08     ` Andi Kleen
@ 2021-05-10 23:34       ` Dan Williams
  2021-05-11  0:01         ` Andi Kleen
                           ` (2 more replies)
  0 siblings, 3 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-10 23:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Mon, May 10, 2021 at 4:08 PM Andi Kleen <ak@linux.intel.com> wrote:
>
>
> On 5/10/2021 2:57 PM, Dan Williams wrote:
> >
> > There is a mix of direct-TDVMCALL usage and handling #VE when and why
> > is either approached used?
>
> For the really early code in the decompressor or the main kernel we
> can't use #VE because the IDT needed for handling the exception is not
> set up, and some other infrastructure needed by the handler is missing.
> The early code needs to do port IO to be able to write the early serial
> console. To keep it all common it ended up that all port IO is paravirt.
> Actually for most the main kernel port IO calls we could just use #VE
> and it would result in smaller binaries, but then we would need to
> annotate all early portio with some special name. That's why port IO is
> all TDCALL.

Thanks Andi. Sathya, please include the above in the next posting.

>
> For some others the only thing that really has to be #VE is MMIO because
> we don't want to annotate every MMIO read*/write* with an alternative
> (which would result in incredible binary bloat) For the others they have
> mostly become now direct calls.
>
>
> >
> >> Decompression code uses port IO for earlyprintk. We must use
> >> paravirt calls there too if we want to allow earlyprintk.
> > What is the tradeoff between teaching the decompression code to handle
> > #VE (the implied assumption) vs teaching it to avoid #VE with direct
> > TDVMCALLs (the chosen direction)?
>
> The decompression code only really needs it to output something. But you
> couldn't debug anything until #VE is set up. Also the decompression code
> has a very basic environment that doesn't supply most kernel services,
> and the #VE handler is relatively complicated. It would probably need to
> be duplicated and the instruction decoder be ported to work in this
> environment. It would be all a lot of work, just to make the debug
> output work.
>
> >
> >> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> >> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> >> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> >> ---
> >>   arch/x86/boot/compressed/Makefile |   1 +
> >>   arch/x86/boot/compressed/tdcall.S |   9 ++
> >>   arch/x86/include/asm/io.h         |   5 +-
> >>   arch/x86/include/asm/tdx.h        |  46 ++++++++-
> >>   arch/x86/kernel/tdcall.S          | 154 ++++++++++++++++++++++++++++++
> > Why is this named "tdcall" when it is implementing tdvmcalls? I must
> > say those names don't really help me understand what they do. Can we
> > have Linux names that don't mandate keeping the spec terminology in my
> > brain's translation cache?
>
> The instruction is called TDCALL. It's always the same instruction
>
> TDVMCALL is the variant when the host processes it (as opposed to the
> TDX module), but it's just a different name space in the call number.
>
>

Ok.

>              \
>
> > Is there a unified Linux name these can be given to stop the
> > proliferation of poor vendor names for similar concepts?
>
> We could use protected_guest()

Looks good.

>
>
> >
> > Does it also not know how to handle #VE to keep it aligned with the
> > runtime code?
>
>
> Not sure I understand the question, but the decompression code supports
> neither alternatives nor #VE. It's a very limited environment.

Yes, that addresses the question.

>
> >
> > Outside the boot decompression code isn't this branch of the "ifdef
> > BOOT_COMPRESSED_MISC_H"  handled by #VE? I also don't see any usage of
> > __{in,out}() in this patch.
>
> I thought it was all alternative after decompression, so the #VE code
> shouldn't be called. We still have it for some reason though.

Right, I'm struggling to understand where these spurious in/out
instructions are coming from that are not replaced by the
alternative's code? Shouldn't those be dropped on the floor and warned
about rather than handled? I.e. shouldn't port-io instruction escapes
that would cause #VE be precluded at build-time?

>
>
> >
> > Perhaps "PAYLOAD_SIZE" since it is used for both input and output?
> >
> > If the ABI does not include the size of the payload then how would
> > code detect if even 80 bytes was violated in the future?
>
>
> The payload in memory is just a Linux concept. At the TDCALL level it's
> only registers.
>

If it's only a Linux concept why does this code need to "prepare for
the future"?


> >
> > 5
> > Surely there's an existing macro for this pattern? Would
> > PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
> > would eliminate clearing of %r8.
>
>
> There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
> some past refactorings.

Not a huge deal, but at a minimum it seems a generic construct that
deserves to be declared centrally rather than tdx-guest-port-io local.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-10 23:34       ` Dan Williams
@ 2021-05-11  0:01         ` Andi Kleen
  2021-05-11  0:21           ` Dan Williams
  2021-05-11  0:30         ` Kuppuswamy, Sathyanarayanan
  2021-05-11  0:56         ` Kuppuswamy, Sathyanarayanan
  2 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-11  0:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Mon, May 10, 2021 at 04:34:34PM -0700, Dan Williams wrote:
> > > Outside the boot decompression code isn't this branch of the "ifdef
> > > BOOT_COMPRESSED_MISC_H"  handled by #VE? I also don't see any usage of
> > > __{in,out}() in this patch.
> >
> > I thought it was all alternative after decompression, so the #VE code
> > shouldn't be called. We still have it for some reason though.
> 
> Right, I'm struggling to understand where these spurious in/out
> instructions are coming from that are not replaced by the
> alternative's code?

There should be nothing in the main tree at least.

> Shouldn't those be dropped on the floor and warned
> about rather than handled? 

It might be related to eventually handling them in ring 3, but
I believe we disallow that currently too and it's not all that useful
anyways.  So yes it could be forbidden.

> I.e. shouldn't port-io instruction escapes
> that would cause #VE be precluded at build-time?

You mean in objtool? That would seem like overkill for a more theoretical
problem.

> > There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
> > some past refactorings.
> 
> Not a huge deal, but at a minimum it seems a generic construct that
> deserves to be declared centrally rather than tdx-guest-port-io local.

Yes I agree. We should just bring SAVE_ALL/SAVE_REGS back.

-Andi

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11  0:01         ` Andi Kleen
@ 2021-05-11  0:21           ` Dan Williams
  0 siblings, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-11  0:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Mon, May 10, 2021 at 5:01 PM Andi Kleen <ak@lin
[..]
> > I.e. shouldn't port-io instruction escapes
> > that would cause #VE be precluded at build-time?
>
> You mean in objtool? That would seem like overkill for a more theoretical
> problem.

Oh, sorry, no, I was not implying objtool overkill, just that the
mainline kernel should not be surprised by spurious instruction usage.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-10 23:34       ` Dan Williams
  2021-05-11  0:01         ` Andi Kleen
@ 2021-05-11  0:30         ` Kuppuswamy, Sathyanarayanan
  2021-05-11  1:07           ` Dan Williams
  2021-05-11  0:56         ` Kuppuswamy, Sathyanarayanan
  2 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-11  0:30 UTC (permalink / raw)
  To: Dan Williams, Andi Kleen
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List



On 5/10/21 4:34 PM, Dan Williams wrote:
> On Mon, May 10, 2021 at 4:08 PM Andi Kleen <ak@linux.intel.com> wrote:
>>
>>
>> On 5/10/2021 2:57 PM, Dan Williams wrote:
>>>
>>> There is a mix of direct-TDVMCALL usage and handling #VE when and why
>>> is either approached used?
>>
>> For the really early code in the decompressor or the main kernel we
>> can't use #VE because the IDT needed for handling the exception is not
>> set up, and some other infrastructure needed by the handler is missing.
>> The early code needs to do port IO to be able to write the early serial
>> console. To keep it all common it ended up that all port IO is paravirt.
>> Actually for most the main kernel port IO calls we could just use #VE
>> and it would result in smaller binaries, but then we would need to
>> annotate all early portio with some special name. That's why port IO is
>> all TDCALL.
> 
> Thanks Andi. Sathya, please include the above in the next posting.

Will include it.

> 
>>
>> For some others the only thing that really has to be #VE is MMIO because
>> we don't want to annotate every MMIO read*/write* with an alternative
>> (which would result in incredible binary bloat) For the others they have
>> mostly become now direct calls.
>>
>>
>>>
>>>> Decompression code uses port IO for earlyprintk. We must use
>>>> paravirt calls there too if we want to allow earlyprintk.
>>> What is the tradeoff between teaching the decompression code to handle
>>> #VE (the implied assumption) vs teaching it to avoid #VE with direct
>>> TDVMCALLs (the chosen direction)?
>>
>> The decompression code only really needs it to output something. But you
>> couldn't debug anything until #VE is set up. Also the decompression code
>> has a very basic environment that doesn't supply most kernel services,
>> and the #VE handler is relatively complicated. It would probably need to
>> be duplicated and the instruction decoder be ported to work in this
>> environment. It would be all a lot of work, just to make the debug
>> output work.
>>
>>>
>>>> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>>>> ---
>>>>    arch/x86/boot/compressed/Makefile |   1 +
>>>>    arch/x86/boot/compressed/tdcall.S |   9 ++
>>>>    arch/x86/include/asm/io.h         |   5 +-
>>>>    arch/x86/include/asm/tdx.h        |  46 ++++++++-
>>>>    arch/x86/kernel/tdcall.S          | 154 ++++++++++++++++++++++++++++++
>>> Why is this named "tdcall" when it is implementing tdvmcalls? I must
>>> say those names don't really help me understand what they do. Can we
>>> have Linux names that don't mandate keeping the spec terminology in my
>>> brain's translation cache?
>>
>> The instruction is called TDCALL. It's always the same instruction
>>
>> TDVMCALL is the variant when the host processes it (as opposed to the
>> TDX module), but it's just a different name space in the call number.
>>
>>
> 
> Ok.
> 
>>               \
>>
>>> Is there a unified Linux name these can be given to stop the
>>> proliferation of poor vendor names for similar concepts?
>>
>> We could use protected_guest()
> 
> Looks good.
> 
>>
>>
>>>
>>> Does it also not know how to handle #VE to keep it aligned with the
>>> runtime code?
>>
>>
>> Not sure I understand the question, but the decompression code supports
>> neither alternatives nor #VE. It's a very limited environment.
> 
> Yes, that addresses the question.
> 
>>
>>>
>>> Outside the boot decompression code isn't this branch of the "ifdef
>>> BOOT_COMPRESSED_MISC_H"  handled by #VE? I also don't see any usage of
>>> __{in,out}() in this patch.
>>
>> I thought it was all alternative after decompression, so the #VE code
>> shouldn't be called. We still have it for some reason though.
> 
> Right, I'm struggling to understand where these spurious in/out
> instructions are coming from that are not replaced by the
> alternative's code? Shouldn't those be dropped on the floor and warned
> about rather than handled? I.e. shouldn't port-io instruction escapes
> that would cause #VE be precluded at build-time?
> 
>>
>>
>>>
>>> Perhaps "PAYLOAD_SIZE" since it is used for both input and output?
>>>
>>> If the ABI does not include the size of the payload then how would
>>> code detect if even 80 bytes was violated in the future?
>>
>>
>> The payload in memory is just a Linux concept. At the TDCALL level it's
>> only registers.
>>
> 
> If it's only a Linux concept why does this code need to "prepare for
> the future"?

It is the software only structure. It is created to group all the output
registers used by VMM. You can find more details about it in patch titled
# "[RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions"

It is mainly used by functions like __tdx_hypercall(),__tdx_hypercall_vendor_kvm()
and tdx_in{b,w,l}.

u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
                     struct tdx_hypercall_output *out);
u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
                                u64 r15, struct tdx_hypercall_output *out);

struct tdx_hypercall_output {
         u64 r11;
         u64 r12;
         u64 r13;
         u64 r14;
         u64 r15;
};


Functions like __tdx_hypercall() and __tdx_hypercall_vendor_kvm() are used
by TDX guest to request services (like RDMSR, WRMSR,GetQuote, etc) from VMM
using TDCALL instruction. do_tdx_hypercall() is the helper function (in
tdcall.S) which actually implements this ABI.

As per current ABI, VMM will use registers R11-R15 to share the output
values with the guest. So we have defined the structure
struct tdx_hypercall_output to group all output registers and make it easier
to share it with users of the TDCALLs. This is Linux defined structure.

If there are any changes in TDCALL ABI for VMM, we might have to extend
this structure to accommodate new output register changes.  So if we
define TDVMCALL_OUTPUT_SIZE as 40, we will have modify this value for
any future struct tdx_hypercall_output changes. So to avoid it, we have
allocated double the size.

May be I should define it as,

#define TDVMCALL_OUTPUT_SIZE            sizeof(struct tdx_hypercall_output)

But currently we don't include the asm/tdx.h (which defines
struct tdx_hypercall_output) in tdcall.S. So I have defined the size as
constant value.

> 
> 
>>>
>>> 5
>>> Surely there's an existing macro for this pattern? Would
>>> PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
>>> would eliminate clearing of %r8.
>>
>>
>> There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
>> some past refactorings.
> 
> Not a huge deal, but at a minimum it seems a generic construct that
> deserves to be declared centrally rather than tdx-guest-port-io local.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-10 23:34       ` Dan Williams
  2021-05-11  0:01         ` Andi Kleen
  2021-05-11  0:30         ` Kuppuswamy, Sathyanarayanan
@ 2021-05-11  0:56         ` Kuppuswamy, Sathyanarayanan
  2021-05-11  2:19           ` Andi Kleen
  2 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-11  0:56 UTC (permalink / raw)
  To: Dan Williams, Andi Kleen
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List



On 5/10/21 4:34 PM, Dan Williams wrote:
>>> Surely there's an existing macro for this pattern? Would
>>> PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
>>> would eliminate clearing of %r8.
>>
>> There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
>> some past refactorings.
> Not a huge deal, but at a minimum it seems a generic construct that
> deserves to be declared centrally rather than tdx-guest-port-io local.

I can define SAVE_ALL_REGS/RESTORE_ALL_REGS. Do you want to move it outside
TDX code? I don't know if there will be other users for it?



-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11  0:30         ` Kuppuswamy, Sathyanarayanan
@ 2021-05-11  1:07           ` Dan Williams
  2021-05-11  2:29             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-11  1:07 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Andi Kleen, Peter Zijlstra, Andy Lutomirski, Dave Hansen,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Mon, May 10, 2021 at 5:30 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
[..]
> It is mainly used by functions like __tdx_hypercall(),__tdx_hypercall_vendor_kvm()
> and tdx_in{b,w,l}.
>
> u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
>                      struct tdx_hypercall_output *out);
> u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
>                                 u64 r15, struct tdx_hypercall_output *out);
>
> struct tdx_hypercall_output {
>          u64 r11;
>          u64 r12;
>          u64 r13;
>          u64 r14;
>          u64 r15;
> };

Why is this by register name and not something like:

struct tdx_hypercall_payload {
  u64 data[5];
};

...because the code in this patch is reading the payload out of a
stack relative offset, not r11.

>
>
> Functions like __tdx_hypercall() and __tdx_hypercall_vendor_kvm() are used
> by TDX guest to request services (like RDMSR, WRMSR,GetQuote, etc) from VMM
> using TDCALL instruction. do_tdx_hypercall() is the helper function (in
> tdcall.S) which actually implements this ABI.
>
> As per current ABI, VMM will use registers R11-R15 to share the output
> values with the guest.

Which ABI, __tdx_hypercall_vendor_kvm()? The code is putting the
payload on the stack, so I'm not sure what ABI you are referring to?


> So we have defined the structure
> struct tdx_hypercall_output to group all output registers and make it easier
> to share it with users of the TDCALLs. This is Linux defined structure.
>
> If there are any changes in TDCALL ABI for VMM, we might have to extend
> this structure to accommodate new output register changes.  So if we
> define TDVMCALL_OUTPUT_SIZE as 40, we will have modify this value for
> any future struct tdx_hypercall_output changes. So to avoid it, we have
> allocated double the size.
>
> May be I should define it as,
>
> #define TDVMCALL_OUTPUT_SIZE            sizeof(struct tdx_hypercall_output)

An arrangement like that seems more reasonable than a seemingly
arbitrary number and an ominous warning about things that may happen
in the future.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-04-26 18:01 ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Kuppuswamy Sathyanarayanan
@ 2021-05-11  1:23   ` Dan Williams
  2021-05-11  2:17     ` Andi Kleen
  2021-05-11 14:08     ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Dave Hansen
  2021-05-11 15:53   ` Dave Hansen
  1 sibling, 2 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-11  1:23 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Mon, Apr 26, 2021 at 11:02 AM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> When running as a TDX guest, there are a number of existing,
> privileged instructions that do not work. If the guest kernel
> uses these instructions, the hardware generates a #VE.
>
> You can find the list of unsupported instructions in Intel
> Trust Domain Extensions (Intel® TDX) Module specification,
> sec 9.2.2 and in Guest-Host Communication Interface (GHCI)
> Specification for Intel TDX, sec 2.4.1.
>

Ah, better than the "handle port io" patch, these details at least
give the reader a chance.

> To prevent TD guest from using MWAIT/MONITOR instructions,
> support for these instructions are already disabled by TDX
> module (SEAM). So CPUID flags for these instructions should
> be in disabled state.

Why does this not result in a #UD if the instruction is disabled by
SEAM? How is it possible to execute a disabled instruction (one
precluded by CPUID) to the point where it triggers #VE instead of #UD?

> After the above mentioned preventive measures, if TD guests still
> execute these instructions, add appropriate warning messages in #VE
> handler. For WBIND instruction, since it's related to memory writeback
> and cache flushes, it's mainly used in context of IO devices. Since
> TDX 1.0 does not support non-virtual I/O devices, skipping it should
> not cause any fatal issues.

WBINVD is in a different class than MWAIT/MONITOR since it is not
identified by CPUID, it can't possibly have the same #UD behaviour.
It's not clear why WBINVD is included in the same patch as
MWAIT/MONITOR?

I disagree with the assertion that WBINVD is mainly used in the
context of I/O devices, it's also used for ACPI power management
paths. WBINVD dependent functionality should be dynamically disabled
rather than warned about.

Does a TDX guest support out-of-tree modules?  The kernel is already
tainted when out-of-tree modules are loaded. In other words in-tree
modules preclude forbidden instructions because they can just be
audited, and out-of-tree modules are ok to trigger abrupt failure if
they attempt to use forbidden instructions.

> But to let users know about its usage, use
> WARN() to report about it.. For MWAIT/MONITOR instruction, since its
> unsupported use WARN() to report unsupported usage.

I'm not sure how useful warning is outside of a kernel developer's
debug environment. The kernel should know what instructions are
disabled and which are available. WBINVD in particular has potential
data integrity implications. Code that might lead to a WBINVD usage
should be disabled, not run all the way up to where WBINVD is
attempted and then trigger an after-the-fact WARN_ONCE().

The WBINVD change deserves to be split off from MWAIT/MONITOR, and
more thought needs to be put into where these spurious instruction
usages are arising.

>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/kernel/tdx.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 3fe617978fc4..294dda5bf3f6 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -371,6 +371,21 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
>         case EXIT_REASON_EPT_VIOLATION:
>                 ve->instr_len = tdg_handle_mmio(regs, ve);
>                 break;
> +       case EXIT_REASON_WBINVD:
> +               /*
> +                * WBINVD is not supported inside TDX guests. All in-
> +                * kernel uses should have been disabled.
> +                */
> +               WARN_ONCE(1, "TD Guest used unsupported WBINVD instruction\n");
> +               break;
> +       case EXIT_REASON_MONITOR_INSTRUCTION:
> +       case EXIT_REASON_MWAIT_INSTRUCTION:
> +               /*
> +                * Something in the kernel used MONITOR or MWAIT despite
> +                * X86_FEATURE_MWAIT being cleared for TDX guests.
> +                */
> +               WARN_ONCE(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n");
> +               break;
>         default:
>                 pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
>                 return -EFAULT;
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11  1:23   ` Dan Williams
@ 2021-05-11  2:17     ` Andi Kleen
  2021-05-11  2:44       ` Kuppuswamy, Sathyanarayanan
  2021-05-11 15:37       ` Dan Williams
  2021-05-11 14:08     ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Dave Hansen
  1 sibling, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-11  2:17 UTC (permalink / raw)
  To: Dan Williams, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List

>> To prevent TD guest from using MWAIT/MONITOR instructions,
>> support for these instructions are already disabled by TDX
>> module (SEAM). So CPUID flags for these instructions should
>> be in disabled state.
> Why does this not result in a #UD if the instruction is disabled by
> SEAM?

It's just the TDX module (SEAM is the execution mode used by the TDX module)


> How is it possible to execute a disabled instruction (one
> precluded by CPUID) to the point where it triggers #VE instead of #UD?

That's how the TDX module works. It never injects anything else other 
than #VE. You can still get other exceptions of course, but they won't 
come from the TDX module.

>> After the above mentioned preventive measures, if TD guests still
>> execute these instructions, add appropriate warning messages in #VE
>> handler. For WBIND instruction, since it's related to memory writeback
>> and cache flushes, it's mainly used in context of IO devices. Since
>> TDX 1.0 does not support non-virtual I/O devices, skipping it should
>> not cause any fatal issues.
> WBINVD is in a different class than MWAIT/MONITOR since it is not
> identified by CPUID, it can't possibly have the same #UD behaviour.
> It's not clear why WBINVD is included in the same patch as
> MWAIT/MONITOR?

Because these are all instructions we never expect to execute, so 
nothing special is needed for them. That's a unique class that logically 
fits together.


>
> I disagree with the assertion that WBINVD is mainly used in the
> context of I/O devices, it's also used for ACPI power management
> paths.

You mean S3? That's of course also not supported inside TDX.


>   WBINVD dependent functionality should be dynamically disabled
> rather than warned about.
>
> Does a TDX guest support out-of-tree modules?  The kernel is already
> tainted when out-of-tree modules are loaded. In other words in-tree
> modules preclude forbidden instructions because they can just be
> audited, and out-of-tree modules are ok to trigger abrupt failure if
> they attempt to use forbidden instructions.

We already did a lot of bi^wdiscussion on this on the last review.

Originally we had a different handling, this was the result of previous 
feedback.

It doesn't really matter because it should never happen.


>
>> But to let users know about its usage, use
>> WARN() to report about it.. For MWAIT/MONITOR instruction, since its
>> unsupported use WARN() to report unsupported usage.
> I'm not sure how useful warning is outside of a kernel developer's
> debug environment. The kernel should know what instructions are
> disabled and which are available. WBINVD in particular has potential
> data integrity implications. Code that might lead to a WBINVD usage
> should be disabled, not run all the way up to where WBINVD is
> attempted and then trigger an after-the-fact WARN_ONCE().

We don't expect the warning to ever happen. Yes all of this will be 
disabled. Nearly all are in code paths that cannot happen inside TDX 
anyways due to missing PCI-IDs or different cpuids, and S3 is explicitly 
disabled and would be impossible anyways due to lack of BIOS support.




>
> The WBINVD change deserves to be split off from MWAIT/MONITOR, and
> more thought needs to be put into where these spurious instruction
> usages are arising.

I disagree. We already spent a lot of cycles on this. WBINVD makes never 
sense in current TDX and all the code will be disabled.


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11  0:56         ` Kuppuswamy, Sathyanarayanan
@ 2021-05-11  2:19           ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-11  2:19 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List


On 5/10/2021 5:56 PM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 5/10/21 4:34 PM, Dan Williams wrote:
>>>> Surely there's an existing macro for this pattern? Would
>>>> PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
>>>> would eliminate clearing of %r8.
>>>
>>> There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
>>> some past refactorings.
>> Not a huge deal, but at a minimum it seems a generic construct that
>> deserves to be declared centrally rather than tdx-guest-port-io local.
>
> I can define SAVE_ALL_REGS/RESTORE_ALL_REGS. Do you want to move it 
> outside
> TDX code? I don't know if there will be other users for it?

The old name was SAVE_ALL / SAVE_REGS.

Yes please put it outside tdx code into some include file.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11  1:07           ` Dan Williams
@ 2021-05-11  2:29             ` Kuppuswamy, Sathyanarayanan
  2021-05-11 14:39               ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-11  2:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andi Kleen, Peter Zijlstra, Andy Lutomirski, Dave Hansen,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List



On 5/10/21 6:07 PM, Dan Williams wrote:
> On Mon, May 10, 2021 at 5:30 PM Kuppuswamy, Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> [..]
>> It is mainly used by functions like __tdx_hypercall(),__tdx_hypercall_vendor_kvm()
>> and tdx_in{b,w,l}.
>>
>> u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
>>                       struct tdx_hypercall_output *out);
>> u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
>>                                  u64 r15, struct tdx_hypercall_output *out);
>>
>> struct tdx_hypercall_output {
>>           u64 r11;
>>           u64 r12;
>>           u64 r13;
>>           u64 r14;
>>           u64 r15;
>> };
> 
> Why is this by register name and not something like:
> 
> struct tdx_hypercall_payload {
>    u64 data[5];
> };
> 
> ...because the code in this patch is reading the payload out of a
> stack relative offset, not r11.

Since this patch allocates this memory in ASM code, we read it via
offset. If you see other use cases in tdx.c, you will notice the use
of register names.

static void tdg_handle_cpuid(struct pt_regs *regs)
{
         u64 ret;
         struct tdx_hypercall_output out = {0};

         ret = __tdx_hypercall(EXIT_REASON_CPUID, regs->ax,
                               regs->cx, 0, 0, &out);

         WARN_ON(ret);

         regs->ax = out.r12;
         regs->bx = out.r13;
         regs->cx = out.r14;
         regs->dx = out.r15;
}

static u64 tdg_read_msr_safe(unsigned int msr, int *err)
{
         u64 ret;
         struct tdx_hypercall_output out = {0};

         WARN_ON_ONCE(tdg_is_context_switched_msr(msr));

         /*
          * Since CSTAR MSR is not used by Intel CPUs as SYSCALL
          * instruction, just ignore it. Even raising TDVMCALL
          * will lead to same result.
          */
         if (msr == MSR_CSTAR)
                 return 0;

         ret = __tdx_hypercall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out);

         *err = ret ? -EIO : 0;

         return out.r11;
}


> 
>>
>>
>> Functions like __tdx_hypercall() and __tdx_hypercall_vendor_kvm() are used
>> by TDX guest to request services (like RDMSR, WRMSR,GetQuote, etc) from VMM
>> using TDCALL instruction. do_tdx_hypercall() is the helper function (in
>> tdcall.S) which actually implements this ABI.
>>
>> As per current ABI, VMM will use registers R11-R15 to share the output
>> values with the guest.
> 
> Which ABI,

TDCALL ABI (see sections 3.1 to 3.12 and look for Output Operands in each TDVMCALL variant).

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

  __tdx_hypercall_vendor_kvm()? The code is putting the
> payload on the stack, so I'm not sure what ABI you are referring to?
> 
> 
>> So we have defined the structure
>> struct tdx_hypercall_output to group all output registers and make it easier
>> to share it with users of the TDCALLs. This is Linux defined structure.
>>
>> If there are any changes in TDCALL ABI for VMM, we might have to extend
>> this structure to accommodate new output register changes.  So if we
>> define TDVMCALL_OUTPUT_SIZE as 40, we will have modify this value for
>> any future struct tdx_hypercall_output changes. So to avoid it, we have
>> allocated double the size.
>>
>> May be I should define it as,
>>
>> #define TDVMCALL_OUTPUT_SIZE            sizeof(struct tdx_hypercall_output)
> 
> An arrangement like that seems more reasonable than a seemingly
> arbitrary number and an ominous warning about things that may happen
> in the future.

I will use the above format.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11  2:17     ` Andi Kleen
@ 2021-05-11  2:44       ` Kuppuswamy, Sathyanarayanan
  2021-05-11  2:51         ` Andi Kleen
  2021-05-11 15:37       ` Dan Williams
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-11  2:44 UTC (permalink / raw)
  To: Andi Kleen, Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List



On 5/10/21 7:17 PM, Andi Kleen wrote:
>>> To prevent TD guest from using MWAIT/MONITOR instructions,
>>> support for these instructions are already disabled by TDX
>>> module (SEAM). So CPUID flags for these instructions should
>>> be in disabled state.
>> Why does this not result in a #UD if the instruction is disabled by
>> SEAM?
> 
> It's just the TDX module (SEAM is the execution mode used by the TDX module)

If it is disabled by the TDX Module, we should never execute it. But for some
reason, if we still come across this instruction (buggy TDX module?), we add
appropriate warning in  #VE handler.

> 
> 
>> How is it possible to execute a disabled instruction (one
>> precluded by CPUID) to the point where it triggers #VE instead of #UD?
> 
> That's how the TDX module works. It never injects anything else other than #VE. You can still get 
> other exceptions of course, but they won't come from the TDX module.
> 
>>> After the above mentioned preventive measures, if TD guests still
>>> execute these instructions, add appropriate warning messages in #VE
>>> handler. For WBIND instruction, since it's related to memory writeback
>>> and cache flushes, it's mainly used in context of IO devices. Since
>>> TDX 1.0 does not support non-virtual I/O devices, skipping it should
>>> not cause any fatal issues.
>> WBINVD is in a different class than MWAIT/MONITOR since it is not
>> identified by CPUID, it can't possibly have the same #UD behaviour.
>> It's not clear why WBINVD is included in the same patch as
>> MWAIT/MONITOR?
> 
> Because these are all instructions we never expect to execute, so nothing special is needed for 
> them. That's a unique class that logically fits together.

Yes, for all these three instruction we don't need any special
handling code. So they are grouped together.

> 
> 
>>
>> I disagree with the assertion that WBINVD is mainly used in the
>> context of I/O devices, it's also used for ACPI power management
>> paths.
> 
> You mean S3? That's of course also not supported inside TDX.
> 
> 
>>   WBINVD dependent functionality should be dynamically disabled
>> rather than warned about.
>>
>> Does a TDX guest support out-of-tree modules?  The kernel is already
>> tainted when out-of-tree modules are loaded. In other words in-tree
>> modules preclude forbidden instructions because they can just be
>> audited, and out-of-tree modules are ok to trigger abrupt failure if
>> they attempt to use forbidden instructions.
> 
> We already did a lot of bi^wdiscussion on this on the last review.
> 
> Originally we had a different handling, this was the result of previous feedback.
> 
> It doesn't really matter because it should never happen.
> 
> 
>>
>>> But to let users know about its usage, use
>>> WARN() to report about it.. For MWAIT/MONITOR instruction, since its
>>> unsupported use WARN() to report unsupported usage.
>> I'm not sure how useful warning is outside of a kernel developer's
>> debug environment. The kernel should know what instructions are
>> disabled and which are available. WBINVD in particular has potential
>> data integrity implications. Code that might lead to a WBINVD usage
>> should be disabled, not run all the way up to where WBINVD is
>> attempted and then trigger an after-the-fact WARN_ONCE().
> 
> We don't expect the warning to ever happen. Yes all of this will be disabled. Nearly all are in code 
> paths that cannot happen inside TDX anyways due to missing PCI-IDs or different cpuids, and S3 is 
> explicitly disabled and would be impossible anyways due to lack of BIOS support.

We have added WARN to let user know about its usage and fix it. By default we should
never hit this path.

> 
> 
> 
> 
>>
>> The WBINVD change deserves to be split off from MWAIT/MONITOR, and
>> more thought needs to be put into where these spurious instruction
>> usages are arising.
> 
> I disagree. We already spent a lot of cycles on this. WBINVD makes never sense in current TDX and 
> all the code will be disabled.

> 
> 
> -Andi
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11  2:44       ` Kuppuswamy, Sathyanarayanan
@ 2021-05-11  2:51         ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-11  2:51 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List


On 5/10/2021 7:44 PM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 5/10/21 7:17 PM, Andi Kleen wrote:
>>>> To prevent TD guest from using MWAIT/MONITOR instructions,
>>>> support for these instructions are already disabled by TDX
>>>> module (SEAM). So CPUID flags for these instructions should
>>>> be in disabled state.
>>> Why does this not result in a #UD if the instruction is disabled by
>>> SEAM?
>>
>> It's just the TDX module (SEAM is the execution mode used by the TDX 
>> module)
>
> If it is disabled by the TDX Module, we should never execute it. But 
> for some
> reason, if we still come across this instruction (buggy TDX module?), 
> we add
> appropriate warning in  #VE handler.

I think the only case where it could happen is if the kernel jumps to a 
random address due to a bug and the destination happens to be these 
instruction bytes. Of course it is exceedingly unlikely.

Or we make some mistake, but that's hopefully fixed quickly.


-Andi

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-10 22:52           ` Sean Christopherson
@ 2021-05-11  9:35             ` Borislav Petkov
  2021-05-20 20:12               ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-05-11  9:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, Kuppuswamy, Sathyanarayanan, Andi Kleen,
	Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel

On Mon, May 10, 2021 at 10:52:49PM +0000, Sean Christopherson wrote:
> I can't find the thread offhand, but Boris proposed something along the lines of
> cpu_has(), but specific to a given flavor of protected guest.  IIRC, it was
> sev_guest_has(SEV_ES) or something like that.
> 
> I 100% agree that we should have actual feature bits somewhere for the various
> protected guest flavors.

Preach brother! :)

/me goes and greps mailboxes...

ah, do you mean this, per chance:

https://lore.kernel.org/kvm/20210421144402.GB5004@zn.tnic/

?

And yes, this has "sev" in the name and dhansen makes sense to me in
wishing to unify all the protected guest feature queries under a common
name. And then depending on the vendor, that common name will call the
respective vendor's helper to answer the protected guest aspect asked
about.

This way, generic code will call

	protected_guest_has()

or so and be nicely abstracted away from the underlying implementation.

Hohumm, yap, sounds nice to me.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11  1:23   ` Dan Williams
  2021-05-11  2:17     ` Andi Kleen
@ 2021-05-11 14:08     ` Dave Hansen
  2021-05-11 16:09       ` Sean Christopherson
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 14:08 UTC (permalink / raw)
  To: Dan Williams, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List

On 5/10/21 6:23 PM, Dan Williams wrote:
>> To prevent TD guest from using MWAIT/MONITOR instructions,
>> support for these instructions are already disabled by TDX
>> module (SEAM). So CPUID flags for these instructions should
>> be in disabled state.
> Why does this not result in a #UD if the instruction is disabled by
> SEAM? How is it possible to execute a disabled instruction (one
> precluded by CPUID) to the point where it triggers #VE instead of #UD?

This is actually a vestige of VMX.  It's quite possible toady to have a
feature which isn't enumerated in CPUID which still exists and "works"
in the silicon.  There are all kinds of pitfalls to doing this, but
folks evidently do it in public clouds all the time.

The CPUID virtualization basically just traps into the hypervisor and
lets the hypervisor set whatever register values it wants to appear when
CPUID "returns".

But, the controls for what instructions generate #UD are actually quite
separate and unrelated to CPUID itself.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11  2:29             ` Kuppuswamy, Sathyanarayanan
@ 2021-05-11 14:39               ` Dave Hansen
  2021-05-11 15:08                 ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 14:39 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Dan Williams
  Cc: Andi Kleen, Peter Zijlstra, Andy Lutomirski, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List

On 5/10/21 7:29 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 5/10/21 6:07 PM, Dan Williams wrote:
>> On Mon, May 10, 2021 at 5:30 PM Kuppuswamy, Sathyanarayanan
>> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>> [..]
>>> It is mainly used by functions like
>>> __tdx_hypercall(),__tdx_hypercall_vendor_kvm()
>>> and tdx_in{b,w,l}.
>>>
>>> u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
>>>                       struct tdx_hypercall_output *out);
>>> u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
>>>                                  u64 r15, struct tdx_hypercall_output
>>> *out);
>>>
>>> struct tdx_hypercall_output {
>>>           u64 r11;
>>>           u64 r12;
>>>           u64 r13;
>>>           u64 r14;
>>>           u64 r15;
>>> };
>>
>> Why is this by register name and not something like:
>>
>> struct tdx_hypercall_payload {
>>    u64 data[5];
>> };
>>
>> ...because the code in this patch is reading the payload out of a
>> stack relative offset, not r11.
> 
> Since this patch allocates this memory in ASM code, we read it via
> offset. If you see other use cases in tdx.c, you will notice the use
> of register names.

To what you do you refer by "this patch allocates this memory in ASM
code"?  Could you point to the specific ASM code that "allocates memory"?

Dan I'll try to answer your question.  TDX has both a "hypercall"
interface for guests to call into hosts and a "seamcall" interface where
guests or hosts can talk to the TDX/SEAM module.

Both of these represent an ABI which _resembles_ a system call ABI.
Values are placed in registers, including a "function" register which is
very similar to a the system call number we place in RAX.

*But* those ABIs was actually designed to (IIRC) resemble the
Windows/Microsoft ABI, not the Linux ABI.  So the register conventions
are unfamiliar.  There is assembly code to convert between the ELF
function call ABI and the TDX ABIs.

For instance, if you are in C code and you call:

	__tdx_hypercall_vendor_kvm(u64 fn, u64 r12, ...

The value for "fn" will be placed in RAX and "r12" will be placed in RDI
for the function call itself.  The assembly code will, for instance,
take the "r12" *VARIABLE* and ensure it gets into the R12 *REGISTER* for
the hypercall.

The same thing happens on the output side.  The TDX ABIs specify
"return" values in certain registers (r11-r15).  However, those
registers are not preserved in our function return ABI.  So, they must
be stashed off in memory into a place where the caller can retrieve them.

Rather than being unstructured "data[]", the value in
tdx_hypercall_output->r11 was actually in register R11 at some point.
If you look at the spec, you can see the functions that use R11.

Let's say there's a hypercall to check for whether puppies are cute.
Here's the kernel side:

bool tdx_hypercall_puppies_are_cute()
{
	struct tdx_hypercall_output out;
	u64 ret;

	ret = __tdx_hypercall_vendor_kvm(HOST_LIKES_PUPPIES, ..., &out);

	/* Did the hypercall even succeed? */
	if (ret != SUCCESS)
		return -EINVAL;

	if (out->r11 == TDX_WHATEVER_CUTE_BIT)
		return true;

	// Nope, I guess puppies are not cute
	return false;
}

The spec would actually say, "Blah blah, puppies are cute if
TDX_WHATEVER_CUTE_BIT is set in r11".  So, this whole setup actually
results in really nice C code that you can sit side-by-side with the
spec and see if they agree.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11 14:39               ` Dave Hansen
@ 2021-05-11 15:08                 ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-11 15:08 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Andi Kleen, Peter Zijlstra, Andy Lutomirski, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List



On 5/11/21 7:39 AM, Dave Hansen wrote:
> To what you do you refer by "this patch allocates this memory in ASM
> code"?  Could you point to the specific ASM code that "allocates memory"?

We use 40 bytes in stack for storing the output register values. It is in
function tdg_inl().

subq $TDVMCALL_OUTPUT_SIZE, %rsp

+SYM_FUNC_START(tdg_inl)
+	io_save_registers
+	/* Set data width to 4 bytes */
+	mov $4, %rsi
+1:
+	mov %rdx, %rcx
+	/* Set 0 in RDX to select in operation */
+	mov $0, %rdx
+	/* Set TDVMCALL function id in RDI */
+	mov $EXIT_REASON_IO_INSTRUCTION, %rdi
+	/* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+	xor %r10, %r10
+	/* Allocate memory in stack for Output */
+	subq $TDVMCALL_OUTPUT_SIZE, %rsp
+	/* Move tdvmcall_output pointer to R9 */
+	movq %rsp, %r9
+
+	call do_tdvmcall

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-10 21:57   ` Dan Williams
  2021-05-10 23:08     ` Andi Kleen
@ 2021-05-11 15:35     ` Dave Hansen
  2021-05-11 15:43       ` Dan Williams
  2021-05-12  6:17       ` Dan Williams
  1 sibling, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 15:35 UTC (permalink / raw)
  To: Dan Williams, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List

On 5/10/21 2:57 PM, Dan Williams wrote:
>> Decompression code uses port IO for earlyprintk. We must use
>> paravirt calls there too if we want to allow earlyprintk.
> What is the tradeoff between teaching the decompression code to handle
> #VE (the implied assumption) vs teaching it to avoid #VE with direct
> TDVMCALLs (the chosen direction)?

To me, the tradeoff is not just "teaching" the code to handle a #VE, but
ensuring that the entire architecture works.

Intentionally invoking a #VE is like making a function call that *MIGHT*
recurse on itself.  Sure, you can try to come up with a story about
bounding the recursion.  But, I don't see any semblance of that in this
series.

Exception-based recursion is really nasty because it's implicit, not
explicit.  That's why I'm advocating for a design where the kernel never
intentionally causes a #VE: it never intentionally recurses without bounds.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11  2:17     ` Andi Kleen
  2021-05-11  2:44       ` Kuppuswamy, Sathyanarayanan
@ 2021-05-11 15:37       ` Dan Williams
  2021-05-11 15:42         ` Andi Kleen
  2021-05-11 15:44         ` Dave Hansen
  1 sibling, 2 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-11 15:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Mon, May 10, 2021 at 7:17 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> >> To prevent TD guest from using MWAIT/MONITOR instructions,
> >> support for these instructions are already disabled by TDX
> >> module (SEAM). So CPUID flags for these instructions should
> >> be in disabled state.
> > Why does this not result in a #UD if the instruction is disabled by
> > SEAM?
>
> It's just the TDX module (SEAM is the execution mode used by the TDX module)
>
>
> > How is it possible to execute a disabled instruction (one
> > precluded by CPUID) to the point where it triggers #VE instead of #UD?
>
> That's how the TDX module works. It never injects anything else other
> than #VE. You can still get other exceptions of course, but they won't
> come from the TDX module.
>
> >> After the above mentioned preventive measures, if TD guests still
> >> execute these instructions, add appropriate warning messages in #VE
> >> handler. For WBIND instruction, since it's related to memory writeback
> >> and cache flushes, it's mainly used in context of IO devices. Since
> >> TDX 1.0 does not support non-virtual I/O devices, skipping it should
> >> not cause any fatal issues.
> > WBINVD is in a different class than MWAIT/MONITOR since it is not
> > identified by CPUID, it can't possibly have the same #UD behaviour.
> > It's not clear why WBINVD is included in the same patch as
> > MWAIT/MONITOR?
>
> Because these are all instructions we never expect to execute, so
> nothing special is needed for them. That's a unique class that logically
> fits together.
>
>
> >
> > I disagree with the assertion that WBINVD is mainly used in the
> > context of I/O devices, it's also used for ACPI power management
> > paths.
>
> You mean S3? That's of course also not supported inside TDX.
>
>
> >   WBINVD dependent functionality should be dynamically disabled
> > rather than warned about.
> >
> > Does a TDX guest support out-of-tree modules?  The kernel is already
> > tainted when out-of-tree modules are loaded. In other words in-tree
> > modules preclude forbidden instructions because they can just be
> > audited, and out-of-tree modules are ok to trigger abrupt failure if
> > they attempt to use forbidden instructions.
>
> We already did a lot of bi^wdiscussion on this on the last review.
>
> Originally we had a different handling, this was the result of previous
> feedback.
>
> It doesn't really matter because it should never happen.
>
>
> >
> >> But to let users know about its usage, use
> >> WARN() to report about it.. For MWAIT/MONITOR instruction, since its
> >> unsupported use WARN() to report unsupported usage.
> > I'm not sure how useful warning is outside of a kernel developer's
> > debug environment. The kernel should know what instructions are
> > disabled and which are available. WBINVD in particular has potential
> > data integrity implications. Code that might lead to a WBINVD usage
> > should be disabled, not run all the way up to where WBINVD is
> > attempted and then trigger an after-the-fact WARN_ONCE().
>
> We don't expect the warning to ever happen. Yes all of this will be
> disabled. Nearly all are in code paths that cannot happen inside TDX
> anyways due to missing PCI-IDs or different cpuids, and S3 is explicitly
> disabled and would be impossible anyways due to lack of BIOS support.
>
>
>
>
> >
> > The WBINVD change deserves to be split off from MWAIT/MONITOR, and
> > more thought needs to be put into where these spurious instruction
> > usages are arising.
>
> I disagree. We already spent a lot of cycles on this. WBINVD makes never
> sense in current TDX and all the code will be disabled.

Why not just drop the patch if it continues to cause people to spend
cycles on it and it addresses a problem that will never happen?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 15:37       ` Dan Williams
@ 2021-05-11 15:42         ` Andi Kleen
  2021-05-11 15:44         ` Dave Hansen
  1 sibling, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-11 15:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List


On 5/11/2021 8:37 AM, Dan Williams wrote:
> O
> Why not just drop the patch if it continues to cause people to spend
> cycles on it and it addresses a problem that will never happen?

We want to at least get some kind of warning if there is really a 
mistake. Just dropping such an ability wouldn't seem right.

That's all that the patch does really.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11 15:35     ` Dave Hansen
@ 2021-05-11 15:43       ` Dan Williams
  2021-05-12  6:17       ` Dan Williams
  1 sibling, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-11 15:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Tue, May 11, 2021 at 8:36 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/10/21 2:57 PM, Dan Williams wrote:
> >> Decompression code uses port IO for earlyprintk. We must use
> >> paravirt calls there too if we want to allow earlyprintk.
> > What is the tradeoff between teaching the decompression code to handle
> > #VE (the implied assumption) vs teaching it to avoid #VE with direct
> > TDVMCALLs (the chosen direction)?
>
> To me, the tradeoff is not just "teaching" the code to handle a #VE, but
> ensuring that the entire architecture works.
>
> Intentionally invoking a #VE is like making a function call that *MIGHT*
> recurse on itself.  Sure, you can try to come up with a story about
> bounding the recursion.  But, I don't see any semblance of that in this
> series.
>
> Exception-based recursion is really nasty because it's implicit, not
> explicit.  That's why I'm advocating for a design where the kernel never
> intentionally causes a #VE: it never intentionally recurses without bounds.

Thanks Dave, this really helps.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 15:37       ` Dan Williams
  2021-05-11 15:42         ` Andi Kleen
@ 2021-05-11 15:44         ` Dave Hansen
  2021-05-11 15:50           ` Dan Williams
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 15:44 UTC (permalink / raw)
  To: Dan Williams, Andi Kleen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On 5/11/21 8:37 AM, Dan Williams wrote:
>> I disagree. We already spent a lot of cycles on this. WBINVD makes never
>> sense in current TDX and all the code will be disabled.
> Why not just drop the patch if it continues to cause people to spend
> cycles on it and it addresses a problem that will never happen?

If someone calls WBINVD, we have a bug.  Not a little bug, either.  It
probably means there's some horribly confused kernel code that's now
facing broken cache coherency.  To me, it's a textbook place to use
BUG_ON().

This also doesn't "address" the problem, it just helps produce a more
coherent warning message.  It's why we have OOPS messages in the page
fault handler: it never makes any sense to dereference a NULL pointer,
yet we have code to make debugging them easier.  It's well worth the ~20
lines of code that this costs us for ease of debugging.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 15:44         ` Dave Hansen
@ 2021-05-11 15:50           ` Dan Williams
  2021-05-11 15:52             ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-11 15:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Tue, May 11, 2021 at 8:45 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/11/21 8:37 AM, Dan Williams wrote:
> >> I disagree. We already spent a lot of cycles on this. WBINVD makes never
> >> sense in current TDX and all the code will be disabled.
> > Why not just drop the patch if it continues to cause people to spend
> > cycles on it and it addresses a problem that will never happen?
>
> If someone calls WBINVD, we have a bug.  Not a little bug, either.  It
> probably means there's some horribly confused kernel code that's now
> facing broken cache coherency.  To me, it's a textbook place to use
> BUG_ON().
>
> This also doesn't "address" the problem, it just helps produce a more
> coherent warning message.  It's why we have OOPS messages in the page
> fault handler: it never makes any sense to dereference a NULL pointer,
> yet we have code to make debugging them easier.  It's well worth the ~20
> lines of code that this costs us for ease of debugging.

The 'default' case in this 'switch' prints the exit reason and faults,
can't that also trigger a backtrace that dumps the exception stack and
the faulting instruction? In other words shouldn't this just fail with
a common way to provide better debug on any unhandled #VE and not try
to continue running past something that "can't" happen?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 15:50           ` Dan Williams
@ 2021-05-11 15:52             ` Andi Kleen
  2021-05-11 16:04               ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-11 15:52 UTC (permalink / raw)
  To: Dan Williams, Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List


> The 'default' case in this 'switch' prints the exit reason and faults,
> can't that also trigger a backtrace that dumps the exception stack and
> the faulting instruction? In other words shouldn't this just fail with
> a common way to provide better debug on any unhandled #VE and not try
> to continue running past something that "can't" happen?

It will use the #GP common code which will do all the backtracing etc.

We didn't think we would need anything else than what #GP already does.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-04-26 18:01 ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Kuppuswamy Sathyanarayanan
  2021-05-11  1:23   ` Dan Williams
@ 2021-05-11 15:53   ` Dave Hansen
  1 sibling, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 15:53 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> For WBIND instruction, since it's related to memory writeback

	^ WBINVD

> and cache flushes, it's mainly used in context of IO devices. Since
> TDX 1.0 does not support non-virtual I/O devices, skipping it should
> not cause any fatal issues. But
Do me a favor:

	grep -ri wbinvd arch/x86/

How many I/O devices do you see?

Please get your ducks in a row here.  Come up with a coherent changelog
about why the arch/x86 use of WBINVD doesn't apply to TDX guests.
Explain the audit that you did.  You *DID* do an audit, right?


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 15:52             ` Andi Kleen
@ 2021-05-11 16:04               ` Dave Hansen
  2021-05-11 17:06                 ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 16:04 UTC (permalink / raw)
  To: Andi Kleen, Dan Williams
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On 5/11/21 8:52 AM, Andi Kleen wrote:
>> The 'default' case in this 'switch' prints the exit reason and faults,
>> can't that also trigger a backtrace that dumps the exception stack and
>> the faulting instruction? In other words shouldn't this just fail with
>> a common way to provide better debug on any unhandled #VE and not try
>> to continue running past something that "can't" happen?
> 
> It will use the #GP common code which will do all the backtracing etc.
> 
> We didn't think we would need anything else than what #GP already does.

How do these end up in practice?  Do they still say "general protection
fault..."?

Isn't that really mean for anyone that goes trying to figure out what
caused these?  If they see a "general protection fault" from WBINVD and
go digging in the SDM for how a #GP can come from WBINVD, won't they be
sorely disappointed?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 14:08     ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Dave Hansen
@ 2021-05-11 16:09       ` Sean Christopherson
  2021-05-11 16:16         ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-11 16:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dan Williams, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Linux Kernel Mailing List

On Tue, May 11, 2021, Dave Hansen wrote:
> On 5/10/21 6:23 PM, Dan Williams wrote:
> >> To prevent TD guest from using MWAIT/MONITOR instructions,
> >> support for these instructions are already disabled by TDX
> >> module (SEAM). So CPUID flags for these instructions should
> >> be in disabled state.
> > Why does this not result in a #UD if the instruction is disabled by
> > SEAM? How is it possible to execute a disabled instruction (one
> > precluded by CPUID) to the point where it triggers #VE instead of #UD?
> 
> This is actually a vestige of VMX.  It's quite possible toady to have a
> feature which isn't enumerated in CPUID which still exists and "works"
> in the silicon.

No, virtualization holes are something else entirely.  

MONITOR/MWAIT are a bit weird; they do have an enable bit in IA32_MISC_ENABLE,
but most VMMs don't context switch IA32_MISC_ENABLE (load guest value on entry,
load host value on exit) because that would add ~250 cycles to every host<->guest
transition.  And IA32_MISC_ENABLE is shared between SMT siblings, which further
complicates loading the guest's value into hardware.  In the end, it's easier to
leave MONITOR/MWAIT enabled in hardware and instead force a VM-Exit.

As for why TDX injects #VE instead of #UD, I suspect it's for the same reason
that KVM emulates MONITOR/MWAIT as nops instead of injecting a #UD.  The CPUID
bit for MONITOR/MWAIT reflects their enabling in IA32_MISC_ENABLE, not raw
support in hardware.  That means there's no definitive way to enumerate to BIOS
that MONITOR/MWAIT are not supported, e.g. AFAICT, EDKII blindly assumes it can
enable MONITOR/MWAIT in IA32_MISC_ENABLE.  To justify #UD instead of #VE, TDX
would have to inject #GP on WRMSR to set IA32_MISC_ENABLE.ENABLE_MONITOR, and
even then there would be weirdness with respect to VMM behavior in response to
TDVMCALL(WRMSR) since the VMM could allow the virtual write.  In the end, it's
again simpler to inject #VE.

> There are all kinds of pitfalls to doing this, but folks evidently do it in
> public clouds all the time.

Virtualization holes are when instructions/features are enumerated via CPUID,
but don't have a control to hide the feature from the guest (or in the case of
CET, multiple feature are buried behind a single control).  So even if the VMM
hides the feature via CPUID, the guest can still _cleanly_ execute the
instruction if it's supported by the underlying hardware.

> The CPUID virtualization basically just traps into the hypervisor and
> lets the hypervisor set whatever register values it wants to appear when
> CPUID "returns".
> 
> But, the controls for what instructions generate #UD are actually quite
> separate and unrelated to CPUID itself.

Eh, any sane VMM will accurately represent its virtual CPU model via CPUID
insofar as possible, there are just too many creaky corners in x86 to make things
100% bombproof.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 16:09       ` Sean Christopherson
@ 2021-05-11 16:16         ` Dave Hansen
  0 siblings, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 16:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dan Williams, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Linux Kernel Mailing List

On 5/11/21 9:09 AM, Sean Christopherson wrote:
>>> Why does this not result in a #UD if the instruction is disabled by
>>> SEAM? How is it possible to execute a disabled instruction (one
>>> precluded by CPUID) to the point where it triggers #VE instead of #UD?
>> This is actually a vestige of VMX.  It's quite possible toady to have a
>> feature which isn't enumerated in CPUID which still exists and "works"
>> in the silicon.
> No, virtualization holes are something else entirely.  

I think the bigger point is that *CPUID* doesn't enable or disable
instructions in and of itself.

It can *reflect* enabling (like OSPKE), but nothing is actually enabled
or disabled via CPUID.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 16:04               ` Dave Hansen
@ 2021-05-11 17:06                 ` Andi Kleen
  2021-05-11 17:42                   ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-11 17:06 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List


need anything else than what #GP already does.

> How do these end up in practice?  Do they still say "general protection
> fault..."?

Yes, but there's a #VE specific message before it that prints the exit 
reason.


>
> Isn't that really mean for anyone that goes trying to figure out what
> caused these?  If they see a "general protection fault" from WBINVD and
> go digging in the SDM for how a #GP can come from WBINVD, won't they be
> sorely disappointed?

They'll see both the message and also that it isn't a true #VE in the 
backtrace.


-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 17:06                 ` Andi Kleen
@ 2021-05-11 17:42                   ` Dave Hansen
  2021-05-11 17:48                     ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 17:42 UTC (permalink / raw)
  To: Andi Kleen, Dan Williams
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On 5/11/21 10:06 AM, Andi Kleen wrote:
>> How do these end up in practice?  Do they still say "general protection
>> fault..."?
> 
> Yes, but there's a #VE specific message before it that prints the exit
> reason.
> 
>> Isn't that really mean for anyone that goes trying to figure out what
>> caused these?  If they see a "general protection fault" from WBINVD and
>> go digging in the SDM for how a #GP can come from WBINVD, won't they be
>> sorely disappointed?
> 
> They'll see both the message and also that it isn't a true #VE in the
> backtrace.

Is there a good reason for the enduring "general protection fault..."
message other than an aversion to refactoring the code?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 17:42                   ` Dave Hansen
@ 2021-05-11 17:48                     ` Andi Kleen
  2021-05-24 23:32                       ` [RFC v2-fix-v2 1/2] x86/tdx: Handle MWAIT and MONITOR Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-11 17:48 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List


> Is there a good reason for the enduring "general protection fault..."
> message other than an aversion to refactoring the code?

You're the first ever to think it's a problem.

We're assuming that kernel developers are smart enough to understand this.

Please I implore everyone to move on from this patch. This is my last 
email on this topic.

-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11 15:35     ` Dave Hansen
  2021-05-11 15:43       ` Dan Williams
@ 2021-05-12  6:17       ` Dan Williams
  2021-05-27  4:23         ` [RFC v2-fix-v1 0/3] " Kuppuswamy Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-12  6:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Tue, May 11, 2021 at 8:36 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/10/21 2:57 PM, Dan Williams wrote:
> >> Decompression code uses port IO for earlyprintk. We must use
> >> paravirt calls there too if we want to allow earlyprintk.
> > What is the tradeoff between teaching the decompression code to handle
> > #VE (the implied assumption) vs teaching it to avoid #VE with direct
> > TDVMCALLs (the chosen direction)?
>
> To me, the tradeoff is not just "teaching" the code to handle a #VE, but
> ensuring that the entire architecture works.
>
> Intentionally invoking a #VE is like making a function call that *MIGHT*
> recurse on itself.  Sure, you can try to come up with a story about
> bounding the recursion.  But, I don't see any semblance of that in this
> series.
>
> Exception-based recursion is really nasty because it's implicit, not
> explicit.  That's why I'm advocating for a design where the kernel never
> intentionally causes a #VE: it never intentionally recurses without bounds.

So this circles back to the common problem with the
mwait/monitor/wbinvd patch and this one. "Can't happen" #VE conditions
should be fatal. I.e. have a nice clear message about why the kernel
failed and halt. All the uses of these #VE triggering instructions can
be eliminated ahead of time with auditing and people that load
unaudited out-of-tree modules that trigger #VE get to keep the pieces.
Said pieces will be described to them by the #VE triggered fail
message. This isn't like split lock disable where the code is
difficult to audit.

What am I missing?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-05-10 15:56         ` Juergen Gross
@ 2021-05-12 12:07           ` Kirill A. Shutemov
  2021-05-12 13:18           ` Peter Zijlstra
  1 sibling, 0 replies; 381+ messages in thread
From: Kirill A. Shutemov @ 2021-05-12 12:07 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Andi Kleen, Borislav Petkov, Kuppuswamy Sathyanarayanan,
	Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On Mon, May 10, 2021 at 05:56:05PM +0200, Juergen Gross wrote:
> On 10.05.21 17:52, Andi Kleen wrote:
> > \
> > > > > CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
> > > > > calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
> > > > > config would be a bloat for TDX.
> > > > 
> > > > Used how? Why is it bloat for TDX?
> > > 
> > > Is there any major downside to move the halt related pvops functions
> > > from CONFIG_PARAVIRT_XXL to CONFIG_PARAVIRT?
> > 
> > I think the main motivation is to get rid of all the page table related
> > hooks for modern configurations. These are the bulk of the annotations
> > and  cause bloat and worse code. Shadow page tables are really obscure
> > these days and very few people still need them and it's totally
> > reasonable to build even widely used distribution kernels without them.
> > On contrast most of the other hooks are comparatively few and also on
> > comparatively slow paths, so don't really matter too much.
> > 
> > I think it would be ok to have a CONFIG_PARAVIRT that does not have page
> > table support, and a separate config option for those (that could be
> > eventually deprecated).
> > 
> > But that would break existing .configs for those shadow stack users,
> > that's why I think Kirill did it the other way around.
> 
> No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
> other hypervisor's guests, supporting basically the TLB flush operations
> and time related operations only. Adding the halt related operations to
> PARAVIRT wouldn't break anything.

Yeah, I think we can do this. It should be fine.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-05-08  0:59     ` Kuppuswamy, Sathyanarayanan
@ 2021-05-12 13:00       ` Kirill A. Shutemov
  2021-05-12 14:10         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Kirill A. Shutemov @ 2021-05-12 13:00 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel, Isaku Yamahata

On Fri, May 07, 2021 at 05:59:34PM -0700, Kuppuswamy, Sathyanarayanan wrote:
> 
> 
> On 5/7/21 2:46 PM, Dave Hansen wrote:
> > I know KVM does weird stuff.  But, this is*really*  weird.  Why are we
> > #including a .c file into another .c file?
> 
> I think Kirill implemented it this way to skip Makefile changes for it. I don't
> see any other KVM direct dependencies in tdx.c.
> 
> I will fix it in next version.

This has to be compiled only for TDX+KVM.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-10 22:23       ` Dave Hansen
@ 2021-05-12 13:08         ` Kirill A. Shutemov
  2021-05-12 15:44           ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kirill A. Shutemov @ 2021-05-12 13:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel

On Mon, May 10, 2021 at 03:23:29PM -0700, Dave Hansen wrote:
> On 5/10/21 3:19 PM, Kuppuswamy, Sathyanarayanan wrote:
> > On 5/7/21 2:54 PM, Dave Hansen wrote:
> >> This doesn't seem much like common code to me.  It seems like 100% SEV
> >> code.  Is this really where we want to move it?
> > 
> > Both SEV and TDX code has requirement to enable
> > CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED and define force_dma_unencrypted()
> > function.
> > 
> > force_dma_unencrypted() is modified by patch titled "x86/tdx: Make DMA
> > pages shared" to add TDX guest specific support.
> > 
> > Since both SEV and TDX code uses it, its moved to common file.
> 
> That's not an excuse to have a bunch of AMD (or Intel) feature-specific
> code in a file named "common".  I'd make an attempt to keep them
> separate and then call into the two separate functions *from* the common
> function.

But why? What good does the additional level of inderection brings?

It's like saying arch/x86/kernel/cpu/common.c shouldn't have anything AMD
or Intel specific. If a function can cover both vendors I don't see a
point for additinal complexity.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-05-10 15:56         ` Juergen Gross
  2021-05-12 12:07           ` Kirill A. Shutemov
@ 2021-05-12 13:18           ` Peter Zijlstra
  2021-05-12 13:24             ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Peter Zijlstra @ 2021-05-12 13:18 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Andi Kleen, Borislav Petkov, Kuppuswamy Sathyanarayanan,
	Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel

On Mon, May 10, 2021 at 05:56:05PM +0200, Juergen Gross wrote:

> No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
> other hypervisor's guests, supporting basically the TLB flush operations
> and time related operations only. Adding the halt related operations to
> PARAVIRT wouldn't break anything.

Also, I don't think anything modern should actually ever hit any of the
HLT instructions, most everything should end up at an MWAIT.

Still, do we wants to give arch_safe_halt() and halt() the
PVOP_ALT_VCALL0() treatment?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-05-12 13:18           ` Peter Zijlstra
@ 2021-05-12 13:24             ` Andi Kleen
  2021-05-12 13:51               ` Juergen Gross
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-12 13:24 UTC (permalink / raw)
  To: Peter Zijlstra, Juergen Gross
  Cc: Borislav Petkov, Kuppuswamy Sathyanarayanan, Andy Lutomirski,
	Dave Hansen, Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel


On 5/12/2021 6:18 AM, Peter Zijlstra wrote:
> On Mon, May 10, 2021 at 05:56:05PM +0200, Juergen Gross wrote:
>
>> No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
>> other hypervisor's guests, supporting basically the TLB flush operations
>> and time related operations only. Adding the halt related operations to
>> PARAVIRT wouldn't break anything.
> Also, I don't think anything modern should actually ever hit any of the
> HLT instructions, most everything should end up at an MWAIT.
>
> Still, do we wants to give arch_safe_halt() and halt() the
> PVOP_ALT_VCALL0() treatment?

 From performance reasons it's pointless to patch. HLT (and MWAIT) are 
so slow anyways that using patching or an indirect pointer is completely 
in the noise. So I would use whatever is cleanest in the code.

-Andi




^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-05-12 13:24             ` Andi Kleen
@ 2021-05-12 13:51               ` Juergen Gross
  2021-05-17 23:50                 ` [RFC v2-fix 1/1] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Juergen Gross @ 2021-05-12 13:51 UTC (permalink / raw)
  To: Andi Kleen, Peter Zijlstra
  Cc: Borislav Petkov, Kuppuswamy Sathyanarayanan, Andy Lutomirski,
	Dave Hansen, Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel


[-- Attachment #1.1.1: Type: text/plain, Size: 960 bytes --]

On 12.05.21 15:24, Andi Kleen wrote:
> 
> On 5/12/2021 6:18 AM, Peter Zijlstra wrote:
>> On Mon, May 10, 2021 at 05:56:05PM +0200, Juergen Gross wrote:
>>
>>> No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
>>> other hypervisor's guests, supporting basically the TLB flush operations
>>> and time related operations only. Adding the halt related operations to
>>> PARAVIRT wouldn't break anything.
>> Also, I don't think anything modern should actually ever hit any of the
>> HLT instructions, most everything should end up at an MWAIT.
>>
>> Still, do we wants to give arch_safe_halt() and halt() the
>> PVOP_ALT_VCALL0() treatment?
> 
>  From performance reasons it's pointless to patch. HLT (and MWAIT) are 
> so slow anyways that using patching or an indirect pointer is completely 
> in the noise. So I would use whatever is cleanest in the code.

This would probably be x86_platform_ops.hyper hooks.


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-05-12 13:00       ` Kirill A. Shutemov
@ 2021-05-12 14:10         ` Kuppuswamy, Sathyanarayanan
  2021-05-12 14:29           ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-12 14:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel, Isaku Yamahata



On 5/12/21 6:00 AM, Kirill A. Shutemov wrote:
> This has to be compiled only for TDX+KVM.

Got it. So if we want to remove the "C" file include, we will have to
add #ifdef CONFIG_KVM_GUEST in Makefile.

ifdef CONFIG_KVM_GUEST
obj-$(CONFIG_INTEL_TDX_GUEST) += tdx-kvm.o
#endif

Dave, do you prefer above change over "C" file include?

  25 #ifdef CONFIG_KVM_GUEST
  26 #include "tdx-kvm.c"
  27 #endif

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-05-12 14:10         ` Kuppuswamy, Sathyanarayanan
@ 2021-05-12 14:29           ` Dave Hansen
  2021-05-13 19:29             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-12 14:29 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Kirill A. Shutemov
  Cc: Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata

On 5/12/21 7:10 AM, Kuppuswamy, Sathyanarayanan wrote:
> On 5/12/21 6:00 AM, Kirill A. Shutemov wrote:
>> This has to be compiled only for TDX+KVM.
> 
> Got it. So if we want to remove the "C" file include, we will have to
> add #ifdef CONFIG_KVM_GUEST in Makefile.
> 
> ifdef CONFIG_KVM_GUEST
> obj-$(CONFIG_INTEL_TDX_GUEST) += tdx-kvm.o
> #endif

Is there truly no dependency between CONFIG_KVM_GUEST and
CONFIG_INTEL_TDX_GUEST?

If there isn't, then the way we do it is adding another (invisible)
Kconfig variable to express the dependency for tdx-kvm.o:

config INTEL_TDX_GUEST_KVM
	bool
	depends on KVM_GUEST && INTEL_TDX_GUEST

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-12 13:08         ` Kirill A. Shutemov
@ 2021-05-12 15:44           ` Dave Hansen
  2021-05-12 15:53             ` Sean Christopherson
  2021-05-18  1:28             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-12 15:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel

On 5/12/21 6:08 AM, Kirill A. Shutemov wrote:
>> That's not an excuse to have a bunch of AMD (or Intel) feature-specific
>> code in a file named "common".  I'd make an attempt to keep them
>> separate and then call into the two separate functions *from* the common
>> function.
> But why? What good does the additional level of inderection brings?
> 
> It's like saying arch/x86/kernel/cpu/common.c shouldn't have anything AMD
> or Intel specific. If a function can cover both vendors I don't see a
> point for additinal complexity.

Because the code is already separate.  You're actually going to some
trouble to move the SEV-specific code and then combine it with the
TDX-specific code.

Anyway, please just give it a shot.  Should take all of ten minutes.  If
it doesn't work out in practice, fine.  You'll have a good paragraph for
the changelog.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-12 15:44           ` Dave Hansen
@ 2021-05-12 15:53             ` Sean Christopherson
  2021-05-13 16:40               ` Kuppuswamy, Sathyanarayanan
  2021-05-18  1:28             ` Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-12 15:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Kuppuswamy, Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel

On Wed, May 12, 2021, Dave Hansen wrote:
> On 5/12/21 6:08 AM, Kirill A. Shutemov wrote:
> >> That's not an excuse to have a bunch of AMD (or Intel) feature-specific
> >> code in a file named "common".  I'd make an attempt to keep them
> >> separate and then call into the two separate functions *from* the common
> >> function.
> > But why? What good does the additional level of inderection brings?
> > 
> > It's like saying arch/x86/kernel/cpu/common.c shouldn't have anything AMD
> > or Intel specific. If a function can cover both vendors I don't see a
> > point for additinal complexity.
> 
> Because the code is already separate.  You're actually going to some
> trouble to move the SEV-specific code and then combine it with the
> TDX-specific code.
> 
> Anyway, please just give it a shot.  Should take all of ten minutes.  If
> it doesn't work out in practice, fine.  You'll have a good paragraph for
> the changelog.

Or maybe wait to see how Boris' propose protected_guest_has() pans out?  E.g. if
we can do "protected_guest_has(MEMORY_ENCRYPTION)" or whatever, then the truly
common bits could be placed into common.c without any vendor-specific logic.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 21/32] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-04-26 18:01 ` [RFC v2 21/32] x86/boot: Add a trampoline for APs booting in 64-bit mode Kuppuswamy Sathyanarayanan
@ 2021-05-13  2:56   ` Dan Williams
  2021-05-18  0:54     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-13  2:56 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Sean Christopherson, Kai Huang

On Mon, Apr 26, 2021 at 11:03 AM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Add a trampoline for booting APs in 64-bit mode via a software handoff
> with BIOS, and use the new trampoline for the ACPI MP wake protocol used
> by TDX.

Lets add a spec reference:

See section "4.1 ACPI-MADT-AP-Wakeup Table" in the Guest-Host
Communication Interface specification for TDX.

Although, there is not much "wake protocol" in this patch, this
appears to be the end of the process after the CPU has been messaged
to start.

> Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
> mode.  For the GDT pointer, create a new entry as the existing storage
> for the pointer occupies the zero entry in the GDT itself.
>
> Reported-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/include/asm/realmode.h          |  1 +
>  arch/x86/kernel/smpboot.c                |  5 +++
>  arch/x86/realmode/rm/header.S            |  1 +
>  arch/x86/realmode/rm/trampoline_64.S     | 49 +++++++++++++++++++++++-
>  arch/x86/realmode/rm/trampoline_common.S |  5 ++-
>  5 files changed, 58 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
> index 5db5d083c873..5066c8b35e7c 100644
> --- a/arch/x86/include/asm/realmode.h
> +++ b/arch/x86/include/asm/realmode.h
> @@ -25,6 +25,7 @@ struct real_mode_header {
>         u32     sev_es_trampoline_start;
>  #endif
>  #ifdef CONFIG_X86_64
> +       u32     trampoline_start64;
>         u32     trampoline_pgd;
>  #endif
>         /* ACPI S3 wakeup */
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 16703c35a944..27d8491d753a 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1036,6 +1036,11 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
>         unsigned long boot_error = 0;
>         unsigned long timeout;
>
> +#ifdef CONFIG_X86_64
> +       if (is_tdx_guest())
> +               start_ip = real_mode_header->trampoline_start64;
> +#endif

Perhaps wrap this into an inline helper in
arch/x86/include/asm/realmode.h so that this routine only does one
assignment to @start_ip at function entry?

> +
>         idle->thread.sp = (unsigned long)task_pt_regs(idle);
>         early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
>         initial_code = (unsigned long)start_secondary;
> diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
> index 8c1db5bf5d78..2eb62be6d256 100644
> --- a/arch/x86/realmode/rm/header.S
> +++ b/arch/x86/realmode/rm/header.S
> @@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
>         .long   pa_sev_es_trampoline_start
>  #endif
>  #ifdef CONFIG_X86_64
> +       .long   pa_trampoline_start64
>         .long   pa_trampoline_pgd;
>  #endif
>         /* ACPI S3 wakeup */
> diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
> index 84c5d1b33d10..12b734b1da8b 100644
> --- a/arch/x86/realmode/rm/trampoline_64.S
> +++ b/arch/x86/realmode/rm/trampoline_64.S
> @@ -143,13 +143,20 @@ SYM_CODE_START(startup_32)
>         movl    %eax, %cr3
>
>         # Set up EFER
> +       movl    $MSR_EFER, %ecx
> +       rdmsr
> +       cmp     pa_tr_efer, %eax
> +       jne     .Lwrite_efer
> +       cmp     pa_tr_efer + 4, %edx
> +       je      .Ldone_efer
> +.Lwrite_efer:
>         movl    pa_tr_efer, %eax
>         movl    pa_tr_efer + 4, %edx
> -       movl    $MSR_EFER, %ecx
>         wrmsr

Is this hunk just a performance optimization to save an unnecessary
wrmsr when it is pre-populated with the right value? Is it required
for this patch? If "yes", it was not clear to me from the changelog,
if "no" seems like it belongs in a standalone optimization patch.

>
> +.Ldone_efer:
>         # Enable paging and in turn activate Long Mode
> -       movl    $(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
> +       movl    $(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax

It seems setting X86_CR0_NE is redundant when coming through
pa_trampoline_compat, is this a standalone fix to make sure that
'numeric-error' is enabled before startup_64?

>         movl    %eax, %cr0
>
>         /*
> @@ -161,6 +168,19 @@ SYM_CODE_START(startup_32)
>         ljmpl   $__KERNEL_CS, $pa_startup_64
>  SYM_CODE_END(startup_32)
>
> +SYM_CODE_START(pa_trampoline_compat)
> +       /*
> +        * In compatibility mode.  Prep ESP and DX for startup_32, then disable
> +        * paging and complete the switch to legacy 32-bit mode.
> +        */
> +       movl    $rm_stack_end, %esp
> +       movw    $__KERNEL_DS, %dx
> +
> +       movl    $(X86_CR0_NE | X86_CR0_PE), %eax
> +       movl    %eax, %cr0
> +       ljmpl   $__KERNEL32_CS, $pa_startup_32
> +SYM_CODE_END(pa_trampoline_compat)
> +
>         .section ".text64","ax"
>         .code64
>         .balign 4
> @@ -169,6 +189,20 @@ SYM_CODE_START(startup_64)
>         jmpq    *tr_start(%rip)
>  SYM_CODE_END(startup_64)
>
> +SYM_CODE_START(trampoline_start64)
> +       /*
> +        * APs start here on a direct transfer from 64-bit BIOS with identity
> +        * mapped page tables.  Load the kernel's GDT in order to gear down to
> +        * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
> +        * segment registers.  Load the zero IDT so any fault triggers a
> +        * shutdown instead of jumping back into BIOS.
> +        */
> +       lidt    tr_idt(%rip)
> +       lgdt    tr_gdt64(%rip)
> +
> +       ljmpl   *tr_compat(%rip)
> +SYM_CODE_END(trampoline_start64)
> +
>         .section ".rodata","a"
>         # Duplicate the global descriptor table
>         # so the kernel can live anywhere
> @@ -182,6 +216,17 @@ SYM_DATA_START(tr_gdt)
>         .quad   0x00cf93000000ffff      # __KERNEL_DS
>  SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
>
> +SYM_DATA_START(tr_gdt64)
> +       .short  tr_gdt_end - tr_gdt - 1 # gdt limit
> +       .long   pa_tr_gdt
> +       .long   0
> +SYM_DATA_END(tr_gdt64)
> +
> +SYM_DATA_START(tr_compat)
> +       .long   pa_trampoline_compat
> +       .short  __KERNEL32_CS
> +SYM_DATA_END(tr_compat)
> +
>         .bss
>         .balign PAGE_SIZE
>  SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
> diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
> index 5033e640f957..506d5897112a 100644
> --- a/arch/x86/realmode/rm/trampoline_common.S
> +++ b/arch/x86/realmode/rm/trampoline_common.S
> @@ -1,4 +1,7 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>         .section ".rodata","a"
>         .balign 16
> -SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
> +SYM_DATA_START_LOCAL(tr_idt)
> +       .short  0
> +       .quad   0
> +SYM_DATA_END(tr_idt)

Curious, is the following not equivalent?

-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+SYM_DATA_LOCAL(tr_idt, .fill 1, 10, 0)

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot for TDX platforms
  2021-04-26 18:01 ` [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot for TDX platforms Kuppuswamy Sathyanarayanan
@ 2021-05-13  3:03   ` Dan Williams
  0 siblings, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-13  3:03 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Sean Christopherson

On Mon, Apr 26, 2021 at 11:03 AM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Avoid operations which will inject #VE during compressed
> boot, which is obviously fatal for TDX platforms.
>
> Details are,
>
>  1. TDX module injects #VE if a TDX guest attempts to write
>     EFER. So skip the WRMSR to set EFER.LME=1 if it's already
>     set. TDX also forces EFER.LME=1, i.e. the branch will always
>     be taken and thus the #VE avoided.

Ah here's the justification for that hunk in the previous patch, are
you sure that hunk belongs in the trampoline patch?

>
>  2. TDX module also injects a #VE if the guest attempts to clear
>     CR0.NE. Ensure CR0.NE is set when loading CR0 during compressed
>     boot. The Setting CR0.NE should be a nop on all CPUs that
>     support 64-bit mode.

Ah, here's the justification for CR0.NE in the previous patch. Did
something go wrong in the patch splitting?

>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/boot/compressed/head_64.S | 5 +++--
>  arch/x86/boot/compressed/pgtable.h | 2 +-
>  2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index e94874f4bbc1..37c2f37d4a0d 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -616,8 +616,9 @@ SYM_CODE_START(trampoline_32bit_src)
>         movl    $MSR_EFER, %ecx
>         rdmsr
>         btsl    $_EFER_LME, %eax
> +       jc      1f
>         wrmsr
> -       popl    %edx
> +1:     popl    %edx
>         popl    %ecx
>
>         /* Enable PAE and LA57 (if required) paging modes */
> @@ -636,7 +637,7 @@ SYM_CODE_START(trampoline_32bit_src)
>         pushl   %eax
>
>         /* Enable paging again */
> -       movl    $(X86_CR0_PG | X86_CR0_PE), %eax
> +       movl    $(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
>         movl    %eax, %cr0
>
>         lret
> diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
> index 6ff7e81b5628..cc9b2529a086 100644
> --- a/arch/x86/boot/compressed/pgtable.h
> +++ b/arch/x86/boot/compressed/pgtable.h
> @@ -6,7 +6,7 @@
>  #define TRAMPOLINE_32BIT_PGTABLE_OFFSET        0
>
>  #define TRAMPOLINE_32BIT_CODE_OFFSET   PAGE_SIZE
> -#define TRAMPOLINE_32BIT_CODE_SIZE     0x70
> +#define TRAMPOLINE_32BIT_CODE_SIZE     0x80
>
>  #define TRAMPOLINE_32BIT_STACK_END     TRAMPOLINE_32BIT_SIZE
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process
  2021-04-26 18:01 ` [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process Kuppuswamy Sathyanarayanan
@ 2021-05-13  3:23   ` Dan Williams
  2021-05-18  0:59     ` [WARNING: UNSCANNABLE EXTRACTION FAILED][WARNING: UNSCANNABLE EXTRACTION FAILED][RFC v2-fix 1/1] x86/boot: Avoid #VE during boot for TDX platforms Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-13  3:23 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Mon, Apr 26, 2021 at 11:03 AM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Skip writing EFER during secondary_startup_64() if the current value is
> also the desired value. This avoids a #VE when running as a TDX guest,
> as the TDX-Module does not allow writes to EFER (even when writing the
> current, fixed value).
>
> Also, preserve CR4.MCE instead of clearing it during boot to avoid a #VE
> when running as a TDX guest. The TDX-Module (effectively part of the
> hypervisor) requires CR4.MCE to be set at all times and injects a #VE
> if the guest attempts to clear CR4.MCE.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/boot/compressed/head_64.S |  5 ++++-
>  arch/x86/kernel/head_64.S          | 13 +++++++++++--
>  2 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index 37c2f37d4a0d..2d79e5f97360 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -622,7 +622,10 @@ SYM_CODE_START(trampoline_32bit_src)
>         popl    %ecx
>
>         /* Enable PAE and LA57 (if required) paging modes */
> -       movl    $X86_CR4_PAE, %eax
> +       movl    %cr4, %eax
> +       /* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
> +       andl    $X86_CR4_MCE, %eax
> +       orl     $X86_CR4_PAE, %eax
>         testl   %edx, %edx
>         jz      1f
>         orl     $X86_CR4_LA57, %eax
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 04bddaaba8e2..92c77cf75542 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -141,7 +141,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
>  1:
>
>         /* Enable PAE mode, PGE and LA57 */
> -       movl    $(X86_CR4_PAE | X86_CR4_PGE), %ecx
> +       movq    %cr4, %rcx
> +       /* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
> +       andl    $X86_CR4_MCE, %ecx
> +       orl     $(X86_CR4_PAE | X86_CR4_PGE), %ecx
>  #ifdef CONFIG_X86_5LEVEL
>         testl   $1, __pgtable_l5_enabled(%rip)
>         jz      1f
> @@ -229,13 +232,19 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
>         /* Setup EFER (Extended Feature Enable Register) */
>         movl    $MSR_EFER, %ecx
>         rdmsr
> +       movl    %eax, %edx

Maybe comment that EFER is being saved here to check if the following
enables are nops, but not a big deal.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

...modulo whether the EFER wrmsr avoidance in PATCH 21 should move here.

>         btsl    $_EFER_SCE, %eax        /* Enable System Call */
>         btl     $20,%edi                /* No Execute supported? */
>         jnc     1f
>         btsl    $_EFER_NX, %eax
>         btsq    $_PAGE_BIT_NX,early_pmd_flags(%rip)
> -1:     wrmsr                           /* Make changes effective */
>
> +       /* Skip the WRMSR if the current value matches the desired value. */
> +1:     cmpl    %edx, %eax
> +       je      1f
> +       xor     %edx, %edx
> +       wrmsr                           /* Make changes effective */
> +1:
>         /* Setup cr0 */
>         movl    $CR0_STATE, %eax
>         /* Make changes effective */
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-12 15:53             ` Sean Christopherson
@ 2021-05-13 16:40               ` Kuppuswamy, Sathyanarayanan
  2021-05-13 17:49                 ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-13 16:40 UTC (permalink / raw)
  To: Sean Christopherson, Dave Hansen
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel



On 5/12/21 8:53 AM, Sean Christopherson wrote:
> On Wed, May 12, 2021, Dave Hansen wrote:
>> On 5/12/21 6:08 AM, Kirill A. Shutemov wrote:
>>>> That's not an excuse to have a bunch of AMD (or Intel) feature-specific
>>>> code in a file named "common".  I'd make an attempt to keep them
>>>> separate and then call into the two separate functions *from* the common
>>>> function.
>>> But why? What good does the additional level of inderection brings?
>>>
>>> It's like saying arch/x86/kernel/cpu/common.c shouldn't have anything AMD
>>> or Intel specific. If a function can cover both vendors I don't see a
>>> point for additinal complexity.
>>
>> Because the code is already separate.  You're actually going to some
>> trouble to move the SEV-specific code and then combine it with the
>> TDX-specific code.
>>
>> Anyway, please just give it a shot.  Should take all of ten minutes.  If
>> it doesn't work out in practice, fine.  You'll have a good paragraph for
>> the changelog.
> 
> Or maybe wait to see how Boris' propose protected_guest_has() pans out?  E.g. if
> we can do "protected_guest_has(MEMORY_ENCRYPTION)" or whatever, then the truly
> common bits could be placed into common.c without any vendor-specific logic.

How about following abstraction? This patch was initially created to enable us use
is_tdx_guest() outside of arch/x86 code. But extended it to support bitmap flags.

commit 188bdd3c97e49020b2bda9efd992a22091423b85
Author: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Date:   Wed May 12 11:35:13 2021 -0700

     tdx: Introduce generic protected_guest abstraction

     Add a generic way to check if we run with an encrypted guest,
     without requiring x86 specific ifdefs. This can then be used in
     non architecture specific code. Enablethis when running under
     TDX/SEV.

     Also add helper functions to set/test encrypted guest feature
     flags.

     Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

diff --git a/arch/Kconfig b/arch/Kconfig
index ecfd3520b676..98c30312555b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -956,6 +956,9 @@ config HAVE_ARCH_NVRAM_OPS
  config ISA_BUS_API
  	def_bool ISA

+config ARCH_HAS_PROTECTED_GUEST
+	bool
+
  #
  # ABI hall of shame
  #
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 07fb4df1d881..001487c21874 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -882,6 +882,7 @@ config INTEL_TDX_GUEST
  	select PARAVIRT_XL
  	select X86_X2APIC
  	select SECURITY_LOCKDOWN_LSM
+	select ARCH_HAS_PROTECTED_GUEST
  	help
  	  Provide support for running in a trusted domain on Intel processors
  	  equipped with Trusted Domain eXtenstions. TDX is a new Intel
@@ -1537,6 +1538,7 @@ config AMD_MEM_ENCRYPT
  	select ARCH_USE_MEMREMAP_PROT
  	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
  	select INSTRUCTION_DECODER
+	select ARCH_HAS_PROTECTED_GUEST
  	help
  	  Say yes to enable support for the encryption of system memory.
  	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ccab6cf91283..8260893c34ae 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -21,6 +21,7 @@
  #include <linux/usb/xhci-dbgp.h>
  #include <linux/static_call.h>
  #include <linux/swiotlb.h>
+#include <linux/protected_guest.h>

  #include <uapi/linux/mount.h>

@@ -107,6 +108,10 @@ static struct resource bss_resource = {
  	.flags	= IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
  };

+#ifdef CONFIG_ARCH_HAS_PROTECTED_GUEST
+DECLARE_BITMAP(protected_guest_flags, PROTECTED_GUEST_BITMAP_LEN);
+EXPORT_SYMBOL(protected_guest_flags);
+#endif

  #ifdef CONFIG_X86_32
  /* CPU data as detected by the assembly code in head_32.S */
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 04a780abb512..45b848ec8325 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -19,6 +19,7 @@
  #include <linux/memblock.h>
  #include <linux/kernel.h>
  #include <linux/mm.h>
+#include <linux/protected_guest.h>

  #include <asm/cpu_entry_area.h>
  #include <asm/stacktrace.h>
@@ -680,6 +681,9 @@ static void __init init_ghcb(int cpu)

  	data->ghcb_active = false;
  	data->backup_ghcb_active = false;
+
+	set_protected_guest_flag(GUEST_TYPE_SEV);
+	set_protected_guest_flag(MEMORY_ENCRYPTION);
  }

  void __init sev_es_init_vc_handling(void)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 4dfacde05f0c..d0207b990fe4 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,7 @@
  #include <asm/vmx.h>

  #include <linux/cpu.h>
+#include <linux/protected_guest.h>

  static struct {
  	unsigned int gpa_width;
@@ -92,6 +93,9 @@ void __init tdx_early_init(void)

  	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);

+	set_protected_guest_flag(GUEST_TYPE_TDX);
+	set_protected_guest_flag(MEMORY_ENCRYPTION);
+
  	tdg_get_info();

  	pr_info("TDX guest is initialized\n");
diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
new file mode 100644
index 000000000000..44e8c642654c
--- /dev/null
+++ b/include/linux/protected_guest.h
@@ -0,0 +1,37 @@
+#ifndef _LINUX_PROTECTED_GUEST_H
+#define _LINUX_PROTECTED_GUEST_H 1
+
+#define PROTECTED_GUEST_BITMAP_LEN	128
+
+/* Protected Guest vendor types */
+#define GUEST_TYPE_TDX			(1)
+#define GUEST_TYPE_SEV			(2)
+
+/* Protected Guest features */
+#define MEMORY_ENCRYPTION		(20)
+
+#ifdef CONFIG_ARCH_HAS_PROTECTED_GUEST
+extern DECLARE_BITMAP(protected_guest_flags, PROTECTED_GUEST_BITMAP_LEN);
+
+static bool protected_guest_has(unsigned long flag)
+{
+	return test_bit(flag, protected_guest_flags);
+}
+
+static inline void set_protected_guest_flag(unsigned long flag)
+{
+	__set_bit(flag, protected_guest_flags);
+}
+
+static inline bool is_protected_guest(void)
+{
+	return ( protected_guest_has(GUEST_TYPE_TDX) |
+		 protected_guest_has(GUEST_TYPE_SEV) );
+}
+#else
+static inline bool protected_guest_has(unsigned long flag) { return false; }
+static inline void set_protected_guest_flag(unsigned long flag) { }
+static inline bool is_protected_guest(void) { return false; }
+#endif
+
+#endif


> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-13 16:40               ` Kuppuswamy, Sathyanarayanan
@ 2021-05-13 17:49                 ` Dave Hansen
  2021-05-13 18:17                   ` Kuppuswamy, Sathyanarayanan
  2021-05-13 19:38                   ` Andi Kleen
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-13 17:49 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Sean Christopherson
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel

On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
> 
> +#define PROTECTED_GUEST_BITMAP_LEN    128
> +
> +/* Protected Guest vendor types */
> +#define GUEST_TYPE_TDX            (1)
> +#define GUEST_TYPE_SEV            (2)
> +
> +/* Protected Guest features */
> +#define MEMORY_ENCRYPTION        (20)

I was assuming we'd reuse the X86_FEATURE infrastructure somehow.  Is
there a good reason not to?

That gives us all the compile-time optimization (via
en/disabled-features.h) and static branches for "free".

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-13 17:49                 ` Dave Hansen
@ 2021-05-13 18:17                   ` Kuppuswamy, Sathyanarayanan
  2021-05-13 19:38                   ` Andi Kleen
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-13 18:17 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel



On 5/13/21 10:49 AM, Dave Hansen wrote:
> On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
>>
>> +#define PROTECTED_GUEST_BITMAP_LEN    128
>> +
>> +/* Protected Guest vendor types */
>> +#define GUEST_TYPE_TDX            (1)
>> +#define GUEST_TYPE_SEV            (2)
>> +
>> +/* Protected Guest features */
>> +#define MEMORY_ENCRYPTION        (20)
> 
> I was assuming we'd reuse the X86_FEATURE infrastructure somehow.  Is
> there a good reason not to?

My assumption is, protected guest abstraction can be also used by
non-x86 arch's in future. So I have tried to keep these definitions
in common code.


> 
> That gives us all the compile-time optimization (via
> en/disabled-features.h) and static branches for "free".
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-05-12 14:29           ` Dave Hansen
@ 2021-05-13 19:29             ` Kuppuswamy, Sathyanarayanan
  2021-05-13 19:33               ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-13 19:29 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov
  Cc: Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata



On 5/12/21 7:29 AM, Dave Hansen wrote:
> On 5/12/21 7:10 AM, Kuppuswamy, Sathyanarayanan wrote:
>> On 5/12/21 6:00 AM, Kirill A. Shutemov wrote:
>>> This has to be compiled only for TDX+KVM.
>>
>> Got it. So if we want to remove the "C" file include, we will have to
>> add #ifdef CONFIG_KVM_GUEST in Makefile.
>>
>> ifdef CONFIG_KVM_GUEST
>> obj-$(CONFIG_INTEL_TDX_GUEST) += tdx-kvm.o
>> #endif
> 
> Is there truly no dependency between CONFIG_KVM_GUEST and
> CONFIG_INTEL_TDX_GUEST?

We want to re-use TDX code with other hypervisors/guests as well. So
we can't create direct dependency with CONFIG_KVM_GUEST in Kconfig.

> 
> If there isn't, then the way we do it is adding another (invisible)
> Kconfig variable to express the dependency for tdx-kvm.o:
> 
> config INTEL_TDX_GUEST_KVM
> 	bool
> 	depends on KVM_GUEST && INTEL_TDX_GUEST

Currently it will only be used for KVM hypercall code. Will it to be
overkill to create a new config over #ifdefs for this use case ? But,
if this is the preferred approach, I will go with this suggestion.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-05-13 19:29             ` Kuppuswamy, Sathyanarayanan
@ 2021-05-13 19:33               ` Dave Hansen
  2021-05-18  0:15                 ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-13 19:33 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Kirill A. Shutemov
  Cc: Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata

On 5/13/21 12:29 PM, Kuppuswamy, Sathyanarayanan wrote:
>> If there isn't, then the way we do it is adding another (invisible)
>> Kconfig variable to express the dependency for tdx-kvm.o:
>>
>> config INTEL_TDX_GUEST_KVM
>>     bool
>>     depends on KVM_GUEST && INTEL_TDX_GUEST
> 
> Currently it will only be used for KVM hypercall code. Will it to be
> overkill to create a new config over #ifdefs for this use case ? But,
> if this is the preferred approach, I will go with this suggestion.

You'll see this done lots of different (valid) ways over the kernel.
(#ifdef'd #including C files is not one of them.)

*My* preference is to use Kconfig in the way I described.  It keeps
makefiles and #ifdef's clean and obvious, relegating the logic to Kconfig.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-13 17:49                 ` Dave Hansen
  2021-05-13 18:17                   ` Kuppuswamy, Sathyanarayanan
@ 2021-05-13 19:38                   ` Andi Kleen
  2021-05-13 19:42                     ` Dave Hansen
  2021-05-17 18:16                     ` Sean Christopherson
  1 sibling, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-13 19:38 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy, Sathyanarayanan, Sean Christopherson
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel


On 5/13/2021 10:49 AM, Dave Hansen wrote:
> On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
>> +#define PROTECTED_GUEST_BITMAP_LEN    128
>> +
>> +/* Protected Guest vendor types */
>> +#define GUEST_TYPE_TDX            (1)
>> +#define GUEST_TYPE_SEV            (2)
>> +
>> +/* Protected Guest features */
>> +#define MEMORY_ENCRYPTION        (20)
> I was assuming we'd reuse the X86_FEATURE infrastructure somehow.  Is
> there a good reason not to?


This for generic code. Would be a gigantic lift and lots of refactoring 
to move that out.

>
> That gives us all the compile-time optimization (via
> en/disabled-features.h) and static branches for "free".

There's no user so far which is anywhere near performance critical, so 
that would be total overkil

BTW right now I'm not even sure we need the bitmap for anything, but I 
guess it doesn't hurt.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-13 19:38                   ` Andi Kleen
@ 2021-05-13 19:42                     ` Dave Hansen
  2021-05-17 18:16                     ` Sean Christopherson
  1 sibling, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-13 19:42 UTC (permalink / raw)
  To: Andi Kleen, Kuppuswamy, Sathyanarayanan, Sean Christopherson
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel

On 5/13/21 12:38 PM, Andi Kleen wrote:
> 
> On 5/13/2021 10:49 AM, Dave Hansen wrote:
>> On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
>>> +#define PROTECTED_GUEST_BITMAP_LEN    128
>>> +
>>> +/* Protected Guest vendor types */
>>> +#define GUEST_TYPE_TDX            (1)
>>> +#define GUEST_TYPE_SEV            (2)
>>> +
>>> +/* Protected Guest features */
>>> +#define MEMORY_ENCRYPTION        (20)
>> I was assuming we'd reuse the X86_FEATURE infrastructure somehow.  Is
>> there a good reason not to?
> 
> This for generic code. Would be a gigantic lift and lots of refactoring
> to move that out.

Ahh, forgot about that.  The whole "x86/mm" subject threw me off.

>> That gives us all the compile-time optimization (via
>> en/disabled-features.h) and static branches for "free".
> 
> There's no user so far which is anywhere near performance critical, so
> that would be total overkil

The *REALLY* nice thing is that it keeps you from having to create stub
functions or #ifdefs and yet the compiler can still optimize the code to
nothing.

Anyway, thanks for the clarification about it being in non-arch code.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-05-07 21:36   ` Dave Hansen
@ 2021-05-13 19:47     ` Andi Kleen
  2021-05-13 20:07       ` Dave Hansen
  2021-05-13 20:14       ` Dave Hansen
  2021-05-18  0:09     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
  1 sibling, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-13 19:47 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson


On 5/7/2021 2:36 PM, Dave Hansen wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> ...
>> The #VE cannot be nested before TDGETVEINFO is called, if there is any
>> reason for it to nest the TD would shut down. The TDX module guarantees
>> that no NMIs (or #MC or similar) can happen in this window. After
>> TDGETVEINFO the #VE handler can nest if needed, although we don’t expect
>> it to happen normally.
> I think this description really needs some work.  Does "The #VE cannot
> be nested" mean that "hardware guarantees that #VE will not be
> generated", or "the #VE must not be nested"?

The next half sentence answers this question..

"if there is any reason for it to nest the TD would shut down."

So it cannot nest.


>
> What does "the TD would shut down" mean?  I think you mean that instead
> of delivering a nested #VE the hardware would actually exit to the host
> and TDX would prevent the guest from being reentered.  Right?


Yes that's a shutdown. I Suppose we could add your sentence.


> I find that description a bit unsatisfying.  Could we make this a bit
> more concrete?


I don't see what could be added. If you have concrete suggestions please 
just propose something.


>   By the way, what about *normal* interrupts?


Normal interrupts are blocked of course like in every other exception or 
interrupt entry.

>
> Maybe we should talk about this in terms of *rules* that folks need to
> follow.  Maybe:
>
> 	NMIs and machine checks are suppressed.  Before this point any
> 	#VE is fatal.  After this point, NMIs and additional #VEs are
> 	permitted.

Okay that's fine for me.


-Andi




^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-05-13 19:47     ` Andi Kleen
@ 2021-05-13 20:07       ` Dave Hansen
  2021-05-13 22:43         ` Andi Kleen
  2021-05-13 20:14       ` Dave Hansen
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-13 20:07 UTC (permalink / raw)
  To: Andi Kleen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson

On 5/13/21 12:47 PM, Andi Kleen wrote:
> "if there is any reason for it to nest the TD would shut down."

The TDX EAS says:

> If, when attempting to inject a #VE, the Intel TDX module discovers
> that the guest TD has not yet retrieved the information for a
> previous #VE (i.e., VE_INFO.VALID is not 0), the TDX module injects a
> #DF into the guest TD to indicate a #VE overrun.

How does that result in a shut down?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-05-13 19:47     ` Andi Kleen
  2021-05-13 20:07       ` Dave Hansen
@ 2021-05-13 20:14       ` Dave Hansen
  1 sibling, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-13 20:14 UTC (permalink / raw)
  To: Andi Kleen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson

On 5/13/21 12:47 PM, Andi Kleen wrote:
> I don't see what could be added. If you have concrete suggestions please
> just propose something.

Oh, boy, I love writing changelogs!  I was hoping that the TDX folks
would chip in to write their own changelogs, but oh well.  You made my day!

--

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either userspace or the kernel:

 * Specific instructions (WBINVD, for example)
 * Specific MSR accesses
 * Specific CPUID leaf accesses
 * Access to TD-shared memory, which includes MMIO

#VE exceptions are never generated on accesses to normal, TD-private memory.

The entry paths do not access TD-shared memory or use those specific
MSRs, instructions, CPUID leaves.  In addition, all interrupts including
NMIs are blocked by the hardware starting with #VE delivery until
TDGETVEINFO is called.  This eliminates the chance of a #VE during the
syscall gap or paranoid entry paths and simplifies #VE handling.

If a guest kernel action which would normally cause a #VE occurs in the
interrupt-disabled region before TDGETVEINFO, a #DF is delivered to the
guest.

Add basic infrastructure to handle any #VE which occurs in the kernel or
userspace.  Later patches will add handling for specific #VE scenarios.

Convert unhandled #VE's (everything, until later in this series) so that
they appear just like a #GP by calling do_general_protection() directly.

--

Did I miss anything?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-05-13 20:07       ` Dave Hansen
@ 2021-05-13 22:43         ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-13 22:43 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson


On 5/13/2021 1:07 PM, Dave Hansen wrote:
> On 5/13/21 12:47 PM, Andi Kleen wrote:
>> "if there is any reason for it to nest the TD would shut down."
> The TDX EAS says:
>
>> If, when attempting to inject a #VE, the Intel TDX module discovers
>> that the guest TD has not yet retrieved the information for a
>> previous #VE (i.e., VE_INFO.VALID is not 0), the TDX module injects a
>> #DF into the guest TD to indicate a #VE overrun.
> How does that result in a shut down?


You're right. It's not a shutdown, but a panic. We'll need to fix the 
comment and replace 'shutdown' with 'panic'


-And






^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-13 19:38                   ` Andi Kleen
  2021-05-13 19:42                     ` Dave Hansen
@ 2021-05-17 18:16                     ` Sean Christopherson
  2021-05-17 18:27                       ` Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-17 18:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Kuppuswamy, Sathyanarayanan, Kirill A. Shutemov,
	Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel

On Thu, May 13, 2021, Andi Kleen wrote:
> 
> On 5/13/2021 10:49 AM, Dave Hansen wrote:
> > On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
> > > +#define PROTECTED_GUEST_BITMAP_LEN    128
> > > +
> > > +/* Protected Guest vendor types */
> > > +#define GUEST_TYPE_TDX            (1)
> > > +#define GUEST_TYPE_SEV            (2)
> > > +
> > > +/* Protected Guest features */
> > > +#define MEMORY_ENCRYPTION        (20)
> > I was assuming we'd reuse the X86_FEATURE infrastructure somehow.  Is
> > there a good reason not to?
> 
> This for generic code. Would be a gigantic lift and lots of refactoring to
> move that out.

What generic code needs access to SEV vs. TDX?  force_dma_unencrypted() is called
from generic code, but its implementation is x86 specific.

> > That gives us all the compile-time optimization (via
> > en/disabled-features.h) and static branches for "free".
> 
> There's no user so far which is anywhere near performance critical, so that
> would be total overkil

SEV already has the sev_enable_key static key that it uses for unrolling string
I/O, so there's at least one (debatable) case that wants to use static branches.

For SEV-ES and TDX, there's a better argument as using X86_FEATURE_* would unlock
alternatives.

> BTW right now I'm not even sure we need the bitmap for anything, but I guess
> it doesn't hurt.
> 
> -Andi
> 
> 

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-17 18:16                     ` Sean Christopherson
@ 2021-05-17 18:27                       ` Kuppuswamy, Sathyanarayanan
  2021-05-17 18:33                         ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-17 18:27 UTC (permalink / raw)
  To: Sean Christopherson, Andi Kleen
  Cc: Dave Hansen, Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel



On 5/17/21 11:16 AM, Sean Christopherson wrote:
> What generic code needs access to SEV vs. TDX?  force_dma_unencrypted() is called
> from generic code, but its implementation is x86 specific.

When the hardening the drivers for TDX usage, we will have requirement to check
for is_protected_guest() to add code specific to protected guests. Since this will
be outside arch/x86, we need common framework for it.

Few examples are,
  * ACPI sleep driver uses WBINVD (when doing cache flushes). We want to skip it for
   TDX.
  * Forcing virtio to use dma API when running with untrusted host.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-17 18:27                       ` Kuppuswamy, Sathyanarayanan
@ 2021-05-17 18:33                         ` Dave Hansen
  2021-05-17 18:37                           ` Sean Christopherson
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-17 18:33 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Sean Christopherson, Andi Kleen
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel

On 5/17/21 11:27 AM, Kuppuswamy, Sathyanarayanan wrote:
> On 5/17/21 11:16 AM, Sean Christopherson wrote:
>> What generic code needs access to SEV vs. TDX? 
>> force_dma_unencrypted() is called from generic code, but its
>> implementation is x86 specific.
> 
> When the hardening the drivers for TDX usage, we will have
> requirement to check for is_protected_guest() to add code specific to
> protected guests. Since this will be outside arch/x86, we need common
> framework for it.

Just remember, a "common framework" doesn't mean that it can't be backed
by extremely arch-specific mechanisms.

For instance, there's a lot of pkey-specific code in mm/mprotect.c.  It
still gets optimized away on x86 with all the goodness of X86_FEATUREs.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-17 18:33                         ` Dave Hansen
@ 2021-05-17 18:37                           ` Sean Christopherson
  2021-05-17 22:32                             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-17 18:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy, Sathyanarayanan, Andi Kleen, Kirill A. Shutemov,
	Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel

On Mon, May 17, 2021, Dave Hansen wrote:
> On 5/17/21 11:27 AM, Kuppuswamy, Sathyanarayanan wrote:
> > On 5/17/21 11:16 AM, Sean Christopherson wrote:
> >> What generic code needs access to SEV vs. TDX? 
> >> force_dma_unencrypted() is called from generic code, but its
> >> implementation is x86 specific.
> > 
> > When the hardening the drivers for TDX usage, we will have
> > requirement to check for is_protected_guest() to add code specific to
> > protected guests. Since this will be outside arch/x86, we need common
> > framework for it.
> 
> Just remember, a "common framework" doesn't mean that it can't be backed
> by extremely arch-specific mechanisms.
> 
> For instance, there's a lot of pkey-specific code in mm/mprotect.c.  It
> still gets optimized away on x86 with all the goodness of X86_FEATUREs.

Ya, exactly.  Ideally, generic code shouldn't have to differentiate between SEV,
SEV-ES, SEV-SNP, TDX, etc..., a vanilla "bool is_protected_guest(void)" should
suffice.  Under the hood, x86's implementation for is_protected_guest() can be
boot_cpu_has() checks (if we want).

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-17 18:37                           ` Sean Christopherson
@ 2021-05-17 22:32                             ` Kuppuswamy, Sathyanarayanan
  2021-05-17 23:11                               ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-17 22:32 UTC (permalink / raw)
  To: Sean Christopherson, Dave Hansen
  Cc: Andi Kleen, Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel



On 5/17/21 11:37 AM, Sean Christopherson wrote:
>> Just remember, a "common framework" doesn't mean that it can't be backed
>> by extremely arch-specific mechanisms.
>>
>> For instance, there's a lot of pkey-specific code in mm/mprotect.c.  It
>> still gets optimized away on x86 with all the goodness of X86_FEATUREs.
> Ya, exactly.  Ideally, generic code shouldn't have to differentiate between SEV,
> SEV-ES, SEV-SNP, TDX, etc..., a vanilla "bool is_protected_guest(void)" should
> suffice.  Under the hood, x86's implementation for is_protected_guest() can be
> boot_cpu_has() checks (if we want).

What about the use case of protected_guest_has(flag)? Do you want to call it with
with X86_FEATURE_* flags outside arch/x86 code ?


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-17 22:32                             ` Kuppuswamy, Sathyanarayanan
@ 2021-05-17 23:11                               ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-17 23:11 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Sean Christopherson, Dave Hansen
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel


On 5/17/2021 3:32 PM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 5/17/21 11:37 AM, Sean Christopherson wrote:
>>> Just remember, a "common framework" doesn't mean that it can't be 
>>> backed
>>> by extremely arch-specific mechanisms.
>>>
>>> For instance, there's a lot of pkey-specific code in mm/mprotect.c.  It
>>> still gets optimized away on x86 with all the goodness of X86_FEATUREs.
>> Ya, exactly.  Ideally, generic code shouldn't have to differentiate 
>> between SEV,
>> SEV-ES, SEV-SNP, TDX, etc..., a vanilla "bool 
>> is_protected_guest(void)" should
>> suffice.  Under the hood, x86's implementation for 
>> is_protected_guest() can be
>> boot_cpu_has() checks (if we want).
>
> What about the use case of protected_guest_has(flag)? Do you want to 
> call it with
> with X86_FEATURE_* flags outside arch/x86 code ?


I don't think we need any flags in the generic code. Just a simple bool 
is enough.


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix 1/1] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT
  2021-05-12 13:51               ` Juergen Gross
@ 2021-05-17 23:50                 ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-17 23:50 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Borislav Petkov
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

CONFIG_PARAVIRT_XXL is mainly defined/used by XEN PV guests. For
other VM guest types, features supported under CONFIG_PARAVIRT
are self sufficient. CONFIG_PARAVIRT mainly provides support for
TLB flush operations and time related operations.

For TDX guest as well, paravirt calls under CONFIG_PARVIRT meets
most of its requirement except the need of HLT and SAFE_HLT
paravirt calls, which is currently defined under
COFNIG_PARAVIRT_XXL.

Since enabling CONFIG_PARAVIRT_XXL is too bloated for TDX guest
like platforms, move HLT and SAFE_HLT paravirt calls under
CONFIG_PARAVIRT.

Moving HLT and SAFE_HLT paravirt calls are not fatal and should not
break any functionality for current users of CONFIG_PARAVIRT.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---

Changes since v1:
 * Removed CONFIG_PARAVIRT_XL
 * Moved HLT and SAFE_HLT under CONFIG_PARAVIRT

 arch/x86/include/asm/irqflags.h       | 40 +++++++++++++++------------
 arch/x86/include/asm/paravirt.h       | 20 +++++++-------
 arch/x86/include/asm/paravirt_types.h |  3 +-
 arch/x86/kernel/paravirt.c            |  4 ++-
 4 files changed, 36 insertions(+), 31 deletions(-)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 144d70ea4393..6671744dbf3c 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -59,6 +59,28 @@ static inline __cpuidle void native_halt(void)
 
 #endif
 
+#ifndef CONFIG_PARAVIRT
+#ifndef __ASSEMBLY__
+/*
+ * Used in the idle loop; sti takes one instruction cycle
+ * to complete:
+ */
+static inline __cpuidle void arch_safe_halt(void)
+{
+	native_safe_halt();
+}
+
+/*
+ * Used when interrupts are already enabled or to
+ * shutdown the processor:
+ */
+static inline __cpuidle void halt(void)
+{
+	native_halt();
+}
+#endif /* __ASSEMBLY__ */
+#endif /* CONFIG_PARAVIRT */
+
 #ifdef CONFIG_PARAVIRT_XXL
 #include <asm/paravirt.h>
 #else
@@ -80,24 +102,6 @@ static __always_inline void arch_local_irq_enable(void)
 	native_irq_enable();
 }
 
-/*
- * Used in the idle loop; sti takes one instruction cycle
- * to complete:
- */
-static inline __cpuidle void arch_safe_halt(void)
-{
-	native_safe_halt();
-}
-
-/*
- * Used when interrupts are already enabled or to
- * shutdown the processor:
- */
-static inline __cpuidle void halt(void)
-{
-	native_halt();
-}
-
 /*
  * For spinlocks, etc:
  */
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4abf110e2243..5d967bce8937 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,6 +84,16 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 	PVOP_VCALL1(mmu.exit_mmap, mm);
 }
 
+static inline void arch_safe_halt(void)
+{
+	PVOP_VCALL0(irq.safe_halt);
+}
+
+static inline void halt(void)
+{
+	PVOP_VCALL0(irq.halt);
+}
+
 #ifdef CONFIG_PARAVIRT_XXL
 static inline void load_sp0(unsigned long sp0)
 {
@@ -145,16 +155,6 @@ static inline void __write_cr4(unsigned long x)
 	PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-static inline void arch_safe_halt(void)
-{
-	PVOP_VCALL0(irq.safe_halt);
-}
-
-static inline void halt(void)
-{
-	PVOP_VCALL0(irq.halt);
-}
-
 static inline void wbinvd(void)
 {
 	PVOP_VCALL0(cpu.wbinvd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index de87087d3bde..68bf35ce6dd5 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -177,10 +177,9 @@ struct pv_irq_ops {
 	struct paravirt_callee_save save_fl;
 	struct paravirt_callee_save irq_disable;
 	struct paravirt_callee_save irq_enable;
-
+#endif
 	void (*safe_halt)(void);
 	void (*halt)(void);
-#endif
 } __no_randomize_layout;
 
 struct pv_mmu_ops {
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c60222ab8ab9..b001f5aaee4a 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -322,9 +322,11 @@ struct paravirt_patch_template pv_ops = {
 	.irq.save_fl		= __PV_IS_CALLEE_SAVE(native_save_fl),
 	.irq.irq_disable	= __PV_IS_CALLEE_SAVE(native_irq_disable),
 	.irq.irq_enable		= __PV_IS_CALLEE_SAVE(native_irq_enable),
+#endif /* CONFIG_PARAVIRT_XXL */
+
+	/* Irq HLT ops. */
 	.irq.safe_halt		= native_safe_halt,
 	.irq.halt		= native_halt,
-#endif /* CONFIG_PARAVIRT_XXL */
 
 	/* Mmu ops. */
 	.mmu.flush_tlb_user	= native_flush_tlb_local,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-07 21:36   ` Dave Hansen
  2021-05-13 19:47     ` Andi Kleen
@ 2021-05-18  0:09     ` Kuppuswamy Sathyanarayanan
  2021-05-18 15:11       ` Dave Hansen
  2021-05-21 18:45       ` [RFC v2-fix " Kuppuswamy, Sathyanarayanan
  1 sibling, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-18  0:09 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the kernel:

 * Specific instructions (WBINVD, for example)
 * Specific MSR accesses
 * Specific CPUID leaf accesses
 * Access to TD-shared memory, which includes MMIO

In the settings that Linux will run in, virtual exceptions are never
generated on accesses to normal, TD-private memory that has been
accepted.

The entry paths do not access TD-shared memory, MMIO regions or use
those specific MSRs, instructions, CPUID leaves that might generate #VE.
In addition, all interrupts including NMIs are blocked by the hardware
starting with #VE delivery until TDGETVEINFO is called.  This eliminates
the chance of a #VE during the syscall gap or paranoid entry paths and
simplifies #VE handling.

After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
although we don't expect it to happen because we don't expect NMIs to
trigger #VEs. Another case where they could happen is if the #VE
exception panics, but in this case there are no guarantees on anything
anyways.

If a guest kernel action which would normally cause a #VE occurs in the
interrupt-disabled region before TDGETVEINFO, a #DF is delivered to the
guest which will result in an oops (and should eventually be a panic, as
we would like to set panic_on_oops to 1 for TDX guests).

Add basic infrastructure to handle any #VE which occurs in the kernel or
userspace.  Later patches will add handling for specific #VE scenarios.

Convert unhandled #VE's (everything, until later in this series) so that
they appear just like a #GP by calling ve_raise_fault() directly.
ve_raise_fault() is similar to #GP handler and is responsible for
sending SIGSEGV to userspace and cpu die and notifying debuggers and
other die chain users.  

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since v1:
 * Removed [RFC v2 07/32] x86/traps: Add do_general_protection() helper function.
 * Instead of resuing #GP handler, defined a custom handler.
 * Fixed commit log as per review comments.

 arch/x86/include/asm/idtentry.h |  4 ++
 arch/x86/include/asm/tdx.h      | 20 ++++++++++
 arch/x86/kernel/idt.c           |  6 +++
 arch/x86/kernel/tdx.c           | 35 +++++++++++++++++
 arch/x86/kernel/traps.c         | 70 +++++++++++++++++++++++++++++++++
 5 files changed, 135 insertions(+)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5eb3bdf36a41..41a0732d5f68 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -619,6 +619,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER,	exc_xen_hypervisor_callback);
 DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER,	exc_xen_unknown_trap);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE,		exc_virtualization_exception);
+#endif
+
 /* Device interrupts common/spurious */
 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	common_interrupt);
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 1d75be21a09b..8ab4067afefc 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -11,6 +11,7 @@
 #include <linux/types.h>
 
 #define TDINFO			1
+#define TDGETVEINFO		3
 
 struct tdx_module_output {
 	u64 rcx;
@@ -29,6 +30,25 @@ struct tdx_hypercall_output {
 	u64 r15;
 };
 
+/*
+ * Used by #VE exception handler to gather the #VE exception
+ * info from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct ve_info {
+	u64 exit_reason;
+	u64 exit_qual;
+	u64 gla;
+	u64 gpa;
+	u32 instr_len;
+	u32 instr_info;
+};
+
+unsigned long tdg_get_ve_info(struct ve_info *ve);
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+		struct ve_info *ve);
+
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..546b6b636c7d 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = {
 	 */
 	INTG(X86_TRAP_PF,		asm_exc_page_fault),
 #endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
+#endif
 };
 
 /*
@@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = {
 	INTG(X86_TRAP_MF,		asm_exc_coprocessor_error),
 	INTG(X86_TRAP_AC,		asm_exc_alignment_check),
 	INTG(X86_TRAP_XF,		asm_exc_simd_coprocessor_error),
+#ifdef CONFIG_INTEL_TDX_GUEST
+	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
+#endif
 
 #ifdef CONFIG_X86_32
 	TSKG(X86_TRAP_DF,		GDT_ENTRY_DOUBLEFAULT_TSS),
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 4dfacde05f0c..b5fffbd86331 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -85,6 +85,41 @@ static void tdg_get_info(void)
 	td_info.attributes = out.rdx;
 }
 
+unsigned long tdg_get_ve_info(struct ve_info *ve)
+{
+	u64 ret;
+	struct tdx_module_output out = {0};
+
+	/*
+	 * NMIs and machine checks are suppressed. Before this point any
+	 * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
+	 * additional #VEs are permitted (but we don't expect them to
+	 * happen unless you panic).
+	 */
+	ret = __tdx_module_call(TDGETVEINFO, 0, 0, 0, 0, &out);
+
+	ve->exit_reason = out.rcx;
+	ve->exit_qual   = out.rdx;
+	ve->gla         = out.r8;
+	ve->gpa         = out.r9;
+	ve->instr_len   = out.r10 & UINT_MAX;
+	ve->instr_info  = out.r10 >> 32;
+
+	return ret;
+}
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+		struct ve_info *ve)
+{
+	/*
+	 * TODO: Add handler support for various #VE exit
+	 * reasons. It will be added by other patches in
+	 * the series.
+	 */
+	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+	return -EFAULT;
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 651e3e508959..af8efa2e57ba 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/vdso.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_X86_64
 #include <asm/x86_init.h>
@@ -1137,6 +1138,75 @@ DEFINE_IDTENTRY(exc_device_not_available)
 	}
 }
 
+#define VEFSTR "VE fault"
+static void ve_raise_fault(struct pt_regs *regs, long error_code)
+{
+	struct task_struct *tsk = current;
+
+	if (user_mode(regs)) {
+		tsk->thread.error_code = error_code;
+		tsk->thread.trap_nr = X86_TRAP_VE;
+
+		/*
+		 * Not fixing up VDSO exceptions similar to #GP handler
+		 * because we don't expect the VDSO to trigger #VE.
+		 */
+		show_signal(tsk, SIGSEGV, "", VEFSTR, regs, error_code);
+		force_sig(SIGSEGV);
+		return;
+	}
+
+
+	if (fixup_exception(regs, X86_TRAP_VE, error_code, 0))
+		return;
+
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_VE;
+
+	/*
+	 * To be potentially processing a kprobe fault and to trust the result
+	 * from kprobe_running(), we have to be non-preemptible.
+	 */
+	if (!preemptible() &&
+	    kprobe_running() &&
+	    kprobe_fault_handler(regs, X86_TRAP_VE))
+		return;
+
+	notify_die(DIE_GPF, VEFSTR, regs, error_code, X86_TRAP_VE, SIGSEGV);
+
+	die_addr(VEFSTR, regs, error_code, 0);
+}
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+	struct ve_info ve;
+	int ret;
+
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+
+	/*
+	 * NMIs/Machine-checks/Interrupts will be in a disabled state
+	 * till TDGETVEINFO TDCALL is executed. This prevents #VE
+	 * nesting issue.
+	 */
+	ret = tdg_get_ve_info(&ve);
+
+	cond_local_irq_enable(regs);
+
+	if (!ret)
+		ret = tdg_handle_virtualization_exception(regs, &ve);
+	/*
+	 * If tdg_handle_virtualization_exception() could not process
+	 * it successfully, treat it as #GP(0) and handle it.
+	 */
+	if (ret)
+		ve_raise_fault(regs, 0);
+
+	cond_local_irq_disable(regs);
+}
+#endif
+
 #ifdef CONFIG_X86_32
 DEFINE_IDTENTRY_SW(iret_error)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-13 19:33               ` Dave Hansen
@ 2021-05-18  0:15                 ` Kuppuswamy Sathyanarayanan
  2021-05-18 15:51                   ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-18  0:15 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Isaku Yamahata,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

KVM hypercalls use the "vmcall" or "vmmcall" instructions.
Although the ABI is similar, those instructions no longer
function for TDX guests. Make vendor specififc TDVMCALLs
instead of VMCALL.

[Isaku: proposed KVM VENDOR string]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
Changes since RFC v2:
 * Introduced INTEL_TDX_GUEST_KVM config for TDX+KVM related changes.
 * Removed "C" include file.
 * Fixed commit log as per Dave's comments.

 arch/x86/Kconfig                |  6 +++++
 arch/x86/include/asm/kvm_para.h | 21 +++++++++++++++
 arch/x86/include/asm/tdx.h      | 41 ++++++++++++++++++++++++++++
 arch/x86/kernel/Makefile        |  1 +
 arch/x86/kernel/tdcall.S        | 20 ++++++++++++++
 arch/x86/kernel/tdx-kvm.c       | 48 +++++++++++++++++++++++++++++++++
 6 files changed, 137 insertions(+)
 create mode 100644 arch/x86/kernel/tdx-kvm.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9e0e0ff76bab..768df1b98487 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -886,6 +886,12 @@ config INTEL_TDX_GUEST
 	  run in a CPU mode that protects the confidentiality of TD memory
 	  contents and the TD’s CPU state from other software, including VMM.
 
+config INTEL_TDX_GUEST_KVM
+	def_bool y
+	depends on KVM_GUEST && INTEL_TDX_GUEST
+	help
+	 This option enables KVM specific hypercalls in TDX guest.
+
 endif #HYPERVISOR_GUEST
 
 source "arch/x86/Kconfig.cpu"
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..2fa85481520b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,6 +6,7 @@
 #include <asm/alternative.h>
 #include <linux/interrupt.h>
 #include <uapi/asm/kvm_para.h>
+#include <asm/tdx.h>
 
 extern void kvmclock_init(void);
 
@@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 static inline long kvm_hypercall0(unsigned int nr)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall0(nr);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr)
@@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
 static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall1(nr, p1);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1)
@@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
 				  unsigned long p2)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall2(nr, p1, p2);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2)
@@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
 				  unsigned long p2, unsigned long p3)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall3(nr, p1, p2, p3);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 				  unsigned long p4)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8ab4067afefc..eb758b506dba 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -73,4 +73,45 @@ static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
+			       u64 r15, struct tdx_hypercall_output *out);
+long tdx_kvm_hypercall0(unsigned int nr);
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1);
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2);
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3);
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3, unsigned long p4);
+#else
+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+				      unsigned long p2)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3,
+				      unsigned long p4)
+{
+	return -ENODEV;
+}
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
+
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 7966c10ea8d1..a90fec004844 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -128,6 +128,7 @@ obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
 obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index a484c4aef6e6..3c57a1d67b79 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -25,6 +25,8 @@
 					  TDG_R12 | TDG_R13 | \
 					  TDG_R14 | TDG_R15 )
 
+#define TDVMCALL_VENDOR_KVM		0x4d564b2e584454 /* "TDX.KVM" */
+
 /*
  * TDX guests use the TDCALL instruction to make requests to the
  * TDX module and hypercalls to the VMM. It is supported in
@@ -213,3 +215,21 @@ SYM_FUNC_START(__tdx_hypercall)
 	call do_tdx_hypercall
 	retq
 SYM_FUNC_END(__tdx_hypercall)
+
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+/*
+ * Helper function for KVM vendor TDVMCALLs. This assembly wrapper
+ * lets us reuse do_tdvmcall() for KVM-specific hypercalls (
+ * TDVMCALL_VENDOR_KVM).
+ */
+SYM_FUNC_START(__tdx_hypercall_vendor_kvm)
+	/*
+	 * R10 is not part of the function call ABI, but it is a part
+	 * of the TDVMCALL ABI. So set it before making call to the
+	 * do_tdx_hypercall().
+	 */
+	movq $TDVMCALL_VENDOR_KVM, %r10
+	call do_tdx_hypercall
+	retq
+SYM_FUNC_END(__tdx_hypercall_vendor_kvm)
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
diff --git a/arch/x86/kernel/tdx-kvm.c b/arch/x86/kernel/tdx-kvm.c
new file mode 100644
index 000000000000..b21453a81e38
--- /dev/null
+++ b/arch/x86/kernel/tdx-kvm.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2020 Intel Corporation */
+
+#include <asm/tdx.h>
+
+static long tdx_kvm_hypercall(unsigned int fn, unsigned long r12,
+			      unsigned long r13, unsigned long r14,
+			      unsigned long r15)
+{
+	return __tdx_hypercall_vendor_kvm(fn, r12, r13, r14, r15, NULL);
+}
+
+/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall0);
+
+/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return tdx_kvm_hypercall(nr, p1, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall1);
+
+/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2)
+{
+	return tdx_kvm_hypercall(nr, p1, p2, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall2);
+
+/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3)
+{
+	return tdx_kvm_hypercall(nr, p1, p2, p3, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall3);
+
+/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3, unsigned long p4)
+{
+	return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-07 21:52   ` Dave Hansen
@ 2021-05-18  0:48     ` Kuppuswamy Sathyanarayanan
  2021-05-18 15:00       ` Dave Hansen
  2021-06-02 19:42     ` [RFC v2-fix-v2 0/2] " Kuppuswamy Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-18  0:48 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

In traditional VMs, MMIO tends to be implemented by giving a
guest access to a mapping which will cause a VMEXIT on access.
That's not possible in TDX guest. So use #VE to implement MMIO
support. In TDX guest, MMIO triggers #VE with EPT_VIOLATION
exit reason.

For now we only handle a subset of instructions that the kernel
uses for MMIO operations. User-space access triggers SIGBUS.

Also, reasons for supporting #VE based MMIO in TDX guest are,

* MMIO is widely used and we'll have more drivers in the future.
* We don't want to annotate every TDX specific MMIO readl/writel etc.
* If we didn't annotate we would need to add an alternative to every
  MMIO access in the kernel (even though 99.9% will never be used on
  TDX) which would be a complete waste and incredible binary bloat
  for nothing.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2:
 * Fixed commit log as per Dave's review.

 arch/x86/kernel/tdx.c | 100 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index b9e3010987e0..9330c7a9ad69 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -5,6 +5,8 @@
 
 #include <asm/tdx.h>
 #include <asm/vmx.h>
+#include <asm/insn.h>
+#include <linux/sched/signal.h> /* force_sig_fault() */
 
 #include <linux/cpu.h>
 #include <linux/protected_guest.h>
@@ -209,6 +211,101 @@ static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
 	}
 }
 
+static unsigned long tdg_mmio(int size, bool write, unsigned long addr,
+		unsigned long val)
+{
+	return tdx_hypercall_out_r11(EXIT_REASON_EPT_VIOLATION, size,
+				     write, addr, val);
+}
+
+static inline void *get_reg_ptr(struct pt_regs *regs, struct insn *insn)
+{
+	static const int regoff[] = {
+		offsetof(struct pt_regs, ax),
+		offsetof(struct pt_regs, cx),
+		offsetof(struct pt_regs, dx),
+		offsetof(struct pt_regs, bx),
+		offsetof(struct pt_regs, sp),
+		offsetof(struct pt_regs, bp),
+		offsetof(struct pt_regs, si),
+		offsetof(struct pt_regs, di),
+		offsetof(struct pt_regs, r8),
+		offsetof(struct pt_regs, r9),
+		offsetof(struct pt_regs, r10),
+		offsetof(struct pt_regs, r11),
+		offsetof(struct pt_regs, r12),
+		offsetof(struct pt_regs, r13),
+		offsetof(struct pt_regs, r14),
+		offsetof(struct pt_regs, r15),
+	};
+	int regno;
+
+	regno = X86_MODRM_REG(insn->modrm.value);
+	if (X86_REX_R(insn->rex_prefix.value))
+		regno += 8;
+
+	return (void *)regs + regoff[regno];
+}
+
+static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+	int size;
+	bool write;
+	unsigned long *reg;
+	struct insn insn;
+	unsigned long val = 0;
+
+	/*
+	 * User mode would mean the kernel exposed a device directly
+	 * to ring3, which shouldn't happen except for things like
+	 * DPDK.
+	 */
+	if (user_mode(regs)) {
+		pr_err("Unexpected user-mode MMIO access.\n");
+		force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);
+		return 0;
+	}
+
+	kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
+	insn_get_length(&insn);
+	insn_get_opcode(&insn);
+
+	write = ve->exit_qual & 0x2;
+
+	size = insn.opnd_bytes;
+	switch (insn.opcode.bytes[0]) {
+	/* MOV r/m8	r8	*/
+	case 0x88:
+	/* MOV r8	r/m8	*/
+	case 0x8A:
+	/* MOV r/m8	imm8	*/
+	case 0xC6:
+		size = 1;
+		break;
+	}
+
+	if (inat_has_immediate(insn.attr)) {
+		BUG_ON(!write);
+		val = insn.immediate.value;
+		tdg_mmio(size, write, ve->gpa, val);
+		return insn.length;
+	}
+
+	BUG_ON(!inat_has_modrm(insn.attr));
+
+	reg = get_reg_ptr(regs, &insn);
+
+	if (write) {
+		memcpy(&val, reg, size);
+		tdg_mmio(size, write, ve->gpa, val);
+	} else {
+		val = tdg_mmio(size, write, ve->gpa, val);
+		memset(reg, 0, size);
+		memcpy(reg, &val, size);
+	}
+	return insn.length;
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -258,6 +355,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_IO_INSTRUCTION:
 		tdg_handle_io(regs, ve->exit_qual);
 		break;
+	case EXIT_REASON_EPT_VIOLATION:
+		ve->instr_len = tdg_handle_mmio(regs, ve);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-05-13  2:56   ` Dan Williams
@ 2021-05-18  0:54     ` Kuppuswamy Sathyanarayanan
  2021-05-18  2:06       ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-18  0:54 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson,
	Kai Huang, Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a trampoline for booting APs in 64-bit mode via a software handoff
with BIOS, and use the new trampoline for the ACPI MP wake protocol used
by TDX. You can find MADT MP wake protocol details in ACPI specification
r6.4, sec 5.2.12.19.

Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
mode.  For the GDT pointer, create a new entry as the existing storage
for the pointer occupies the zero entry in the GDT itself.

Reported-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2:
 * Removed X86_CR0_NE and EFER related changes from this changes
   and moved it to patch titled "x86/boot: Avoid #VE during
   boot for TDX platforms"
 * Fixed commit log as per Dan's suggestion.
 * Added inline get_trampoline_start_ip() to set start_ip.

 arch/x86/boot/compressed/pgtable.h       |  2 +-
 arch/x86/include/asm/realmode.h          | 10 +++++++
 arch/x86/kernel/smpboot.c                |  2 +-
 arch/x86/realmode/rm/header.S            |  1 +
 arch/x86/realmode/rm/trampoline_64.S     | 38 ++++++++++++++++++++++++
 arch/x86/realmode/rm/trampoline_common.S |  7 ++++-
 6 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
 
 #define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE	0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE	0x80
 
 #define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
 
diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 5db5d083c873..3328c8edb200 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -25,6 +25,7 @@ struct real_mode_header {
 	u32	sev_es_trampoline_start;
 #endif
 #ifdef CONFIG_X86_64
+	u32	trampoline_start64;
 	u32	trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
@@ -88,6 +89,15 @@ static inline void set_real_mode_mem(phys_addr_t mem)
 	real_mode_header = (struct real_mode_header *) __va(mem);
 }
 
+static inline unsigned long get_trampoline_start_ip(void)
+{
+#ifdef CONFIG_X86_64
+        if (is_tdx_guest())
+                return real_mode_header->trampoline_start64;
+#endif
+	return real_mode_header->trampoline_start;
+}
+
 void reserve_real_mode(void);
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 16703c35a944..0b4dff5e67a9 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1031,7 +1031,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
 		       int *cpu0_nmi_registered)
 {
 	/* start_ip had better be page-aligned! */
-	unsigned long start_ip = real_mode_header->trampoline_start;
+	unsigned long start_ip = get_trampoline_start_ip();
 
 	unsigned long boot_error = 0;
 	unsigned long timeout;
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 8c1db5bf5d78..2eb62be6d256 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
 	.long	pa_sev_es_trampoline_start
 #endif
 #ifdef CONFIG_X86_64
+	.long	pa_trampoline_start64
 	.long	pa_trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 84c5d1b33d10..754f8d2ac9e8 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
 	ljmpl	$__KERNEL_CS, $pa_startup_64
 SYM_CODE_END(startup_32)
 
+SYM_CODE_START(pa_trampoline_compat)
+	/*
+	 * In compatibility mode.  Prep ESP and DX for startup_32, then disable
+	 * paging and complete the switch to legacy 32-bit mode.
+	 */
+	movl	$rm_stack_end, %esp
+	movw	$__KERNEL_DS, %dx
+
+	movl	$(X86_CR0_NE | X86_CR0_PE), %eax
+	movl	%eax, %cr0
+	ljmpl   $__KERNEL32_CS, $pa_startup_32
+SYM_CODE_END(pa_trampoline_compat)
+
 	.section ".text64","ax"
 	.code64
 	.balign 4
@@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
 	jmpq	*tr_start(%rip)
 SYM_CODE_END(startup_64)
 
+SYM_CODE_START(trampoline_start64)
+	/*
+	 * APs start here on a direct transfer from 64-bit BIOS with identity
+	 * mapped page tables.  Load the kernel's GDT in order to gear down to
+	 * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
+	 * segment registers.  Load the zero IDT so any fault triggers a
+	 * shutdown instead of jumping back into BIOS.
+	 */
+	lidt	tr_idt(%rip)
+	lgdt	tr_gdt64(%rip)
+
+	ljmpl	*tr_compat(%rip)
+SYM_CODE_END(trampoline_start64)
+
 	.section ".rodata","a"
 	# Duplicate the global descriptor table
 	# so the kernel can live anywhere
@@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
 	.quad	0x00cf93000000ffff	# __KERNEL_DS
 SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
 
+SYM_DATA_START(tr_gdt64)
+	.short	tr_gdt_end - tr_gdt - 1	# gdt limit
+	.long	pa_tr_gdt
+	.long	0
+SYM_DATA_END(tr_gdt64)
+
+SYM_DATA_START(tr_compat)
+	.long	pa_trampoline_compat
+	.short	__KERNEL32_CS
+SYM_DATA_END(tr_compat)
+
 	.bss
 	.balign	PAGE_SIZE
 SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
index 5033e640f957..ade7db208e4e 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -1,4 +1,9 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 	.section ".rodata","a"
 	.balign	16
-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+
+/* .fill cannot be used for size > 8. So use short and quad */
+SYM_DATA_START_LOCAL(tr_idt)
+	.short  0
+	.quad   0
+SYM_DATA_END(tr_idt)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [WARNING: UNSCANNABLE EXTRACTION FAILED][WARNING: UNSCANNABLE EXTRACTION FAILED][RFC v2-fix 1/1] x86/boot: Avoid #VE during boot for TDX platforms
  2021-05-13  3:23   ` Dan Williams
@ 2021-05-18  0:59     ` Kuppuswamy Sathyanarayanan
  2021-05-19 16:53       ` [RFC " Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-18  0:59 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson,
	Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Avoid operations which will inject #VE during boot process,
which is obviously fatal for TDX platforms.

Details are,

1. TDX module injects #VE if a TDX guest attempts to write
   EFER.
   
   Boot code updates EFER in following cases:
   
   * When enabling Long Mode configuration, EFER.LME bit will
     be set. Since TDX forces EFER.LME=1, we can skip updating
     it again. Check for EFER.LME before updating it and skip
     it if it is already set.

   * EFER is also updated to enable support for features like
     System call and No Execute page setting. In TDX, these
     features are set up by the TDX module. So check whether
     it is already enabled, and skip enabling it again.
   
2. TDX module also injects a #VE if the guest attempts to clear
   CR0.NE. Ensure CR0.NE is set when loading CR0 during compressed
   boot. The Setting CR0.NE should be a nop on all CPUs that
   support 64-bit mode.
   
3. The TDX-Module (effectively part of the hypervisor) requires
   CR4.MCE to be set at all times and injects a #VE if the guest
   attempts to clear CR4.MCE. So, preserve CR4.MCE instead of
   clearing it during boot to avoid #VE.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2:
 * Merged Avoid #VE related changes together.
   * [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot
     for TDX platforms
   * [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process.
 * Fixed commit log as per review comments.

 arch/x86/boot/compressed/head_64.S   | 10 +++++++---
 arch/x86/kernel/head_64.S            | 13 +++++++++++--
 arch/x86/realmode/rm/trampoline_64.S | 11 +++++++++--
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index e94874f4bbc1..2d79e5f97360 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -616,12 +616,16 @@ SYM_CODE_START(trampoline_32bit_src)
 	movl	$MSR_EFER, %ecx
 	rdmsr
 	btsl	$_EFER_LME, %eax
+	jc	1f
 	wrmsr
-	popl	%edx
+1:	popl	%edx
 	popl	%ecx
 
 	/* Enable PAE and LA57 (if required) paging modes */
-	movl	$X86_CR4_PAE, %eax
+	movl	%cr4, %eax
+	/* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
+	andl	$X86_CR4_MCE, %eax
+	orl	$X86_CR4_PAE, %eax
 	testl	%edx, %edx
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
@@ -636,7 +640,7 @@ SYM_CODE_START(trampoline_32bit_src)
 	pushl	%eax
 
 	/* Enable paging again */
-	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
+	movl	$(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	lret
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..92c77cf75542 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 1:
 
 	/* Enable PAE mode, PGE and LA57 */
-	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+	movq	%cr4, %rcx
+	/* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
+	andl	$X86_CR4_MCE, %ecx
+	orl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
 #ifdef CONFIG_X86_5LEVEL
 	testl	$1, __pgtable_l5_enabled(%rip)
 	jz	1f
@@ -229,13 +232,19 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	/* Setup EFER (Extended Feature Enable Register) */
 	movl	$MSR_EFER, %ecx
 	rdmsr
+	movl    %eax, %edx
 	btsl	$_EFER_SCE, %eax	/* Enable System Call */
 	btl	$20,%edi		/* No Execute supported? */
 	jnc     1f
 	btsl	$_EFER_NX, %eax
 	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
-1:	wrmsr				/* Make changes effective */
 
+	/* Skip the WRMSR if the current value matches the desired value. */
+1:	cmpl	%edx, %eax
+	je	1f
+	xor	%edx, %edx
+	wrmsr				/* Make changes effective */
+1:
 	/* Setup cr0 */
 	movl	$CR0_STATE, %eax
 	/* Make changes effective */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 754f8d2ac9e8..12b734b1da8b 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,20 @@ SYM_CODE_START(startup_32)
 	movl	%eax, %cr3
 
 	# Set up EFER
+	movl	$MSR_EFER, %ecx
+	rdmsr
+	cmp	pa_tr_efer, %eax
+	jne	.Lwrite_efer
+	cmp	pa_tr_efer + 4, %edx
+	je	.Ldone_efer
+.Lwrite_efer:
 	movl	pa_tr_efer, %eax
 	movl	pa_tr_efer + 4, %edx
-	movl	$MSR_EFER, %ecx
 	wrmsr
 
+.Ldone_efer:
 	# Enable paging and in turn activate Long Mode
-	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix 1/1] x86/tdx: Make DMA pages shared
  2021-04-26 18:01 ` [RFC v2 30/32] x86/tdx: Make DMA pages shared Kuppuswamy Sathyanarayanan
@ 2021-05-18  1:19   ` Kuppuswamy Sathyanarayanan
  2021-05-18 19:55     ` Sean Christopherson
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-18  1:19 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kai Huang,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Intel TDX doesn't allow VMM to access guest memory. Any memory
that is required for communication with VMM must be shared
explicitly by setting the bit in page table entry. And, after
setting the shared bit, the conversion must be completed with
MapGPA TDVMALL. The call informs VMM about the conversion and
makes it remove the GPA from the S-EPT mapping. The shared
memory is similar to unencrypted memory in AMD SME/SEV terminology
but the underlying process of sharing/un-sharing the memory is
different for Intel TDX guest platform.

SEV assumes that I/O devices can only do DMA to "decrypted"
physical addresses without the C-bit set.  In order for the CPU
to interact with this memory, the CPU needs a decrypted mapping.
To add this support, AMD SME code forces force_dma_unencrypted()
to return true for platforms that support AMD SEV feature. It will
be used for DMA memory allocation API to trigger
set_memory_decrypted() for platforms that support AMD SEV feature.

TDX is similar.  TDX architecturally prevents access to private
guest memory by anything other than the guest itself. This means
that any DMA buffers must be shared.

So create a new file mem_encrypt_tdx.c to hold TDX specific memory
initialization code, and re-define force_dma_unencrypted() for
TDX guest and make it return true to get DMA pages mapped as shared.

__set_memory_enc_dec() is now aware about TDX and sets Shared bit
accordingly following with relevant TDVMCALL.

Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when
converting memory to private.  If the VMM uses a common pool for private
and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA
(or on the first access to the private GPA), in which case TDX-Module will
hold the page in a non-present "pending" state until it is explicitly
accepted.

BUG() if TDACCEPTPAGE fails (except the above case), as the guest is
completely hosed if it can't access memory. 

Tested-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2:
 * Since the common code between AMD-SEV and TDX is very minimal,
   defining a new config (X86_MEM_ENCRYPT_COMMON) for common code
   is not very useful. So createed a seperate file for Intel TDX
   specific memory initialization (similar to AMD SEV).
 * Removed patch titled "x86/mm: Move force_dma_unencrypted() to
   common code" from this series. And merged required changes in
   this patch.

 arch/x86/Kconfig              |  1 +
 arch/x86/include/asm/tdx.h    |  3 +++
 arch/x86/kernel/tdx.c         | 26 ++++++++++++++++++-
 arch/x86/mm/Makefile          |  1 +
 arch/x86/mm/mem_encrypt_tdx.c | 19 ++++++++++++++
 arch/x86/mm/pat/set_memory.c  | 48 +++++++++++++++++++++++++++++------
 6 files changed, 89 insertions(+), 9 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_tdx.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a055594e2664..69a98bcdc07a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -879,6 +879,7 @@ config INTEL_TDX_GUEST
 	select X86_X2APIC
 	select SECURITY_LOCKDOWN_LSM
 	select ARCH_HAS_PROTECTED_GUEST
+	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
 	select DYNAMIC_PHYSICAL_MASK
 	help
 	  Provide support for running in a trusted domain on Intel processors
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f5e8088dabc5..4ad436cc2146 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -19,6 +19,9 @@ enum tdx_map_type {
 
 #define TDINFO			1
 #define TDGETVEINFO		3
+#define TDACCEPTPAGE		6
+
+#define TDX_PAGE_ALREADY_ACCEPTED	0x8000000000000001
 
 struct tdx_module_output {
 	u64 rcx;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 9ddb80adc034..caf8e4c5ddbc 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -100,7 +100,8 @@ static void tdg_get_info(void)
 	physical_mask &= ~tdg_shared_mask();
 }
 
-int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+static int __tdg_map_gpa(phys_addr_t gpa, int numpages,
+			 enum tdx_map_type map_type)
 {
 	u64 ret;
 
@@ -111,6 +112,29 @@ int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
 	return ret ? -EIO : 0;
 }
 
+static void tdg_accept_page(phys_addr_t gpa)
+{
+	u64 ret;
+
+	ret = __tdx_module_call(TDACCEPTPAGE, gpa, 0, 0, 0, NULL);
+
+	BUG_ON(ret && ret != TDX_PAGE_ALREADY_ACCEPTED);
+}
+
+int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+{
+	int ret, i;
+
+	ret = __tdg_map_gpa(gpa, numpages, map_type);
+	if (ret || map_type == TDX_MAP_SHARED)
+		return ret;
+
+	for (i = 0; i < numpages; i++)
+		tdg_accept_page(gpa + i*PAGE_SIZE);
+
+	return 0;
+}
+
 static __cpuidle void tdg_halt(void)
 {
 	u64 ret;
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..555dcc0cd087 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -55,3 +55,4 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= mem_encrypt_tdx.o
diff --git a/arch/x86/mm/mem_encrypt_tdx.c b/arch/x86/mm/mem_encrypt_tdx.c
new file mode 100644
index 000000000000..f394a43bf46d
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_tdx.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Intel TDX Memory Encryption Support
+ *
+ * Copyright (C) 2020 Intel Corporation
+ *
+ * Author: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/dma-mapping.h>
+
+#include <asm/tdx.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+	return is_tdx_guest();
+}
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..ea78c7907847 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -27,6 +27,7 @@
 #include <asm/proto.h>
 #include <asm/memtype.h>
 #include <asm/set_memory.h>
+#include <asm/tdx.h>
 
 #include "../mm_internal.h"
 
@@ -1972,13 +1973,15 @@ int set_memory_global(unsigned long addr, int numpages)
 				    __pgprot(_PAGE_GLOBAL), 0);
 }
 
-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
+static int __set_memory_protect(unsigned long addr, int numpages, bool protect)
 {
+	pgprot_t mem_protected_bits, mem_plain_bits;
 	struct cpa_data cpa;
+	enum tdx_map_type map_type;
 	int ret;
 
-	/* Nothing to do if memory encryption is not active */
-	if (!mem_encrypt_active())
+	/* Nothing to do if memory encryption and TDX are not active */
+	if (!mem_encrypt_active() && !is_tdx_guest())
 		return 0;
 
 	/* Should not be working on unaligned addresses */
@@ -1988,8 +1991,25 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	memset(&cpa, 0, sizeof(cpa));
 	cpa.vaddr = &addr;
 	cpa.numpages = numpages;
-	cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
-	cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+
+	if (is_tdx_guest()) {
+		mem_protected_bits = __pgprot(0);
+		mem_plain_bits = __pgprot(tdg_shared_mask());
+	} else {
+		mem_protected_bits = __pgprot(_PAGE_ENC);
+		mem_plain_bits = __pgprot(0);
+	}
+
+	if (protect) {
+		cpa.mask_set = mem_protected_bits;
+		cpa.mask_clr = mem_plain_bits;
+		map_type = TDX_MAP_PRIVATE;
+	} else {
+		cpa.mask_set = mem_plain_bits;
+		cpa.mask_clr = mem_protected_bits;
+		map_type = TDX_MAP_SHARED;
+	}
+
 	cpa.pgd = init_mm.pgd;
 
 	/* Must avoid aliasing mappings in the highmem code */
@@ -1998,8 +2018,16 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 
 	/*
 	 * Before changing the encryption attribute, we need to flush caches.
+	 *
+	 * For TDX we need to flush caches on private->shared. VMM is
+	 * responsible for flushing on shared->private.
 	 */
-	cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+	if (is_tdx_guest()) {
+		if (map_type == TDX_MAP_SHARED)
+			cpa_flush(&cpa, 1);
+	} else {
+		cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+	}
 
 	ret = __change_page_attr_set_clr(&cpa, 1);
 
@@ -2012,18 +2040,22 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	 */
 	cpa_flush(&cpa, 0);
 
+	if (!ret && is_tdx_guest()) {
+		ret = tdg_map_gpa(__pa(addr), numpages, map_type);
+	}
+
 	return ret;
 }
 
 int set_memory_encrypted(unsigned long addr, int numpages)
 {
-	return __set_memory_enc_dec(addr, numpages, true);
+	return __set_memory_protect(addr, numpages, true);
 }
 EXPORT_SYMBOL_GPL(set_memory_encrypted);
 
 int set_memory_decrypted(unsigned long addr, int numpages)
 {
-	return __set_memory_enc_dec(addr, numpages, false);
+	return __set_memory_protect(addr, numpages, false);
 }
 EXPORT_SYMBOL_GPL(set_memory_decrypted);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-12 15:44           ` Dave Hansen
  2021-05-12 15:53             ` Sean Christopherson
@ 2021-05-18  1:28             ` Kuppuswamy, Sathyanarayanan
  2021-05-27  4:46               ` Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-18  1:28 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov
  Cc: Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 5/12/21 8:44 AM, Dave Hansen wrote:
> Because the code is already separate.  You're actually going to some
> trouble to move the SEV-specific code and then combine it with the
> TDX-specific code.
> 
> Anyway, please just give it a shot.  Should take all of ten minutes.  If
> it doesn't work out in practice, fine.  You'll have a good paragraph for
> the changelog.

After reviewing the code again, I have noticed that we don't really have
much common code between AMD and TDX. So I don't see any justification for
creating this common layer. So, I have decided to drop this patch and move
Intel TDX specific memory encryption init code to patch titled "[RFC v2 30/32]
x86/tdx: Make DMA pages shared". This model is similar to how AMD-SEV
does the initialization.

I have sent the modified patch as reply to patch titled "[RFC v2 30/32]
x86/tdx: Make DMA pages shared". Please check and let me know your comments.
-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-05-18  0:54     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-05-18  2:06       ` Dan Williams
  2021-05-18  2:53         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-18  2:06 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Sean Christopherson, Kai Huang

I notice that you have [RFC v2-fix 1/1] as the prefix for this patch.
b4 recently gained support for partial series re-rolls [1], but I
think you would need to bump the version number [RFC PATCH v3 21/32]
and maintain the patch numbering. In this case with changes moving
between patches, and those other patches being squashed any chance of
automated reconstruction of this series is likely lost.

Just wanted to note that for future reference in case you were hoping
to avoid resending full series in the future. For now, some more
comments below:

[1]: https://lore.kernel.org/tools/20210517161317.teawoh5qovxpmqdc@nitro.local/

On Mon, May 17, 2021 at 5:54 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Add a trampoline for booting APs in 64-bit mode via a software handoff
> with BIOS, and use the new trampoline for the ACPI MP wake protocol used
> by TDX. You can find MADT MP wake protocol details in ACPI specification
> r6.4, sec 5.2.12.19.
>
> Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
> mode.  For the GDT pointer, create a new entry as the existing storage
> for the pointer occupies the zero entry in the GDT itself.
>
> Reported-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>
> Changes since RFC v2:
>  * Removed X86_CR0_NE and EFER related changes from this changes

This was only partially done, see below...

>    and moved it to patch titled "x86/boot: Avoid #VE during
>    boot for TDX platforms"
>  * Fixed commit log as per Dan's suggestion.
>  * Added inline get_trampoline_start_ip() to set start_ip.

You also added a comment to tr_idt, but didn't mention it here, so I
went to double check. Please take care to document all changes to the
patch from the previous review.

>
>  arch/x86/boot/compressed/pgtable.h       |  2 +-
>  arch/x86/include/asm/realmode.h          | 10 +++++++
>  arch/x86/kernel/smpboot.c                |  2 +-
>  arch/x86/realmode/rm/header.S            |  1 +
>  arch/x86/realmode/rm/trampoline_64.S     | 38 ++++++++++++++++++++++++
>  arch/x86/realmode/rm/trampoline_common.S |  7 ++++-
>  6 files changed, 57 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
> index 6ff7e81b5628..cc9b2529a086 100644
> --- a/arch/x86/boot/compressed/pgtable.h
> +++ b/arch/x86/boot/compressed/pgtable.h
> @@ -6,7 +6,7 @@
>  #define TRAMPOLINE_32BIT_PGTABLE_OFFSET        0
>
>  #define TRAMPOLINE_32BIT_CODE_OFFSET   PAGE_SIZE
> -#define TRAMPOLINE_32BIT_CODE_SIZE     0x70
> +#define TRAMPOLINE_32BIT_CODE_SIZE     0x80
>
>  #define TRAMPOLINE_32BIT_STACK_END     TRAMPOLINE_32BIT_SIZE
>
> diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
> index 5db5d083c873..3328c8edb200 100644
> --- a/arch/x86/include/asm/realmode.h
> +++ b/arch/x86/include/asm/realmode.h
> @@ -25,6 +25,7 @@ struct real_mode_header {
>         u32     sev_es_trampoline_start;
>  #endif
>  #ifdef CONFIG_X86_64
> +       u32     trampoline_start64;
>         u32     trampoline_pgd;
>  #endif
>         /* ACPI S3 wakeup */
> @@ -88,6 +89,15 @@ static inline void set_real_mode_mem(phys_addr_t mem)
>         real_mode_header = (struct real_mode_header *) __va(mem);
>  }
>
> +static inline unsigned long get_trampoline_start_ip(void)

I'd prefer this helper take a 'struct real_mode_header *rmh' as an
argument rather than assume a global variable.

> +{
> +#ifdef CONFIG_X86_64
> +        if (is_tdx_guest())
> +                return real_mode_header->trampoline_start64;
> +#endif
> +       return real_mode_header->trampoline_start;
> +}
> +
>  void reserve_real_mode(void);
>
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 16703c35a944..0b4dff5e67a9 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1031,7 +1031,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
>                        int *cpu0_nmi_registered)
>  {
>         /* start_ip had better be page-aligned! */
> -       unsigned long start_ip = real_mode_header->trampoline_start;
> +       unsigned long start_ip = get_trampoline_start_ip();
>
>         unsigned long boot_error = 0;
>         unsigned long timeout;
> diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
> index 8c1db5bf5d78..2eb62be6d256 100644
> --- a/arch/x86/realmode/rm/header.S
> +++ b/arch/x86/realmode/rm/header.S
> @@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
>         .long   pa_sev_es_trampoline_start
>  #endif
>  #ifdef CONFIG_X86_64
> +       .long   pa_trampoline_start64
>         .long   pa_trampoline_pgd;
>  #endif
>         /* ACPI S3 wakeup */
> diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
> index 84c5d1b33d10..754f8d2ac9e8 100644
> --- a/arch/x86/realmode/rm/trampoline_64.S
> +++ b/arch/x86/realmode/rm/trampoline_64.S
> @@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
>         ljmpl   $__KERNEL_CS, $pa_startup_64
>  SYM_CODE_END(startup_32)
>
> +SYM_CODE_START(pa_trampoline_compat)
> +       /*
> +        * In compatibility mode.  Prep ESP and DX for startup_32, then disable
> +        * paging and complete the switch to legacy 32-bit mode.
> +        */
> +       movl    $rm_stack_end, %esp
> +       movw    $__KERNEL_DS, %dx
> +
> +       movl    $(X86_CR0_NE | X86_CR0_PE), %eax

Before this patch the startup path did not touch X86_CR0_NE. I assume
it was added opportunistically for the TDX case? If it is to stay in
this patch it deserves a code comment / mention in the changelog, or
it needs to move to the other patch that fixes up the CR0 setup for
TDX.


> +       movl    %eax, %cr0
> +       ljmpl   $__KERNEL32_CS, $pa_startup_32
> +SYM_CODE_END(pa_trampoline_compat)
> +
>         .section ".text64","ax"
>         .code64
>         .balign 4
> @@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
>         jmpq    *tr_start(%rip)
>  SYM_CODE_END(startup_64)
>
> +SYM_CODE_START(trampoline_start64)
> +       /*
> +        * APs start here on a direct transfer from 64-bit BIOS with identity
> +        * mapped page tables.  Load the kernel's GDT in order to gear down to
> +        * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
> +        * segment registers.  Load the zero IDT so any fault triggers a
> +        * shutdown instead of jumping back into BIOS.
> +        */
> +       lidt    tr_idt(%rip)
> +       lgdt    tr_gdt64(%rip)
> +
> +       ljmpl   *tr_compat(%rip)
> +SYM_CODE_END(trampoline_start64)
> +
>         .section ".rodata","a"
>         # Duplicate the global descriptor table
>         # so the kernel can live anywhere
> @@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
>         .quad   0x00cf93000000ffff      # __KERNEL_DS
>  SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
>
> +SYM_DATA_START(tr_gdt64)
> +       .short  tr_gdt_end - tr_gdt - 1 # gdt limit
> +       .long   pa_tr_gdt
> +       .long   0
> +SYM_DATA_END(tr_gdt64)
> +
> +SYM_DATA_START(tr_compat)
> +       .long   pa_trampoline_compat
> +       .short  __KERNEL32_CS
> +SYM_DATA_END(tr_compat)
> +
>         .bss
>         .balign PAGE_SIZE
>  SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
> diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
> index 5033e640f957..ade7db208e4e 100644
> --- a/arch/x86/realmode/rm/trampoline_common.S
> +++ b/arch/x86/realmode/rm/trampoline_common.S
> @@ -1,4 +1,9 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>         .section ".rodata","a"
>         .balign 16
> -SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
> +
> +/* .fill cannot be used for size > 8. So use short and quad */

If there is to be a comment here it should be to clarify why @tr_idt
is 10 bytes, not necessarily a quirk of the assembler.

> +SYM_DATA_START_LOCAL(tr_idt)

The .fill restriction is only for @size, not @repeat. So, what's wrong
with SYM_DATA_LOCAL(tr_idt, .fill 2, 5, 0)?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-05-18  2:06       ` Dan Williams
@ 2021-05-18  2:53         ` Kuppuswamy, Sathyanarayanan
  2021-05-18  4:08           ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-18  2:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Sean Christopherson, Kai Huang



On 5/17/21 7:06 PM, Dan Williams wrote:
> I notice that you have [RFC v2-fix 1/1] as the prefix for this patch.
> b4 recently gained support for partial series re-rolls [1], but I
> think you would need to bump the version number [RFC PATCH v3 21/32]
> and maintain the patch numbering. In this case with changes moving
> between patches, and those other patches being squashed any chance of
> automated reconstruction of this series is likely lost.

Ok. I will make sure to bump the version in next partial re-roll.

If I am fixing this patch as per your comments, do I need bump the
patch version for it as well?

> 
> Just wanted to note that for future reference in case you were hoping
> to avoid resending full series in the future. For now, some more
> comments below:

Thanks.

> 
> [1]: https://lore.kernel.org/tools/20210517161317.teawoh5qovxpmqdc@nitro.local/
> 
> On Mon, May 17, 2021 at 5:54 PM Kuppuswamy Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>>
>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>
>> Add a trampoline for booting APs in 64-bit mode via a software handoff
>> with BIOS, and use the new trampoline for the ACPI MP wake protocol used
>> by TDX. You can find MADT MP wake protocol details in ACPI specification
>> r6.4, sec 5.2.12.19.
>>
>> Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
>> mode.  For the GDT pointer, create a new entry as the existing storage
>> for the pointer occupies the zero entry in the GDT itself.
>>
>> Reported-by: Kai Huang <kai.huang@intel.com>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> ---
>>
>> Changes since RFC v2:
>>   * Removed X86_CR0_NE and EFER related changes from this changes
> 
> This was only partially done, see below...
> 
>>     and moved it to patch titled "x86/boot: Avoid #VE during
>>     boot for TDX platforms"
>>   * Fixed commit log as per Dan's suggestion.
>>   * Added inline get_trampoline_start_ip() to set start_ip.
> 
> You also added a comment to tr_idt, but didn't mention it here, so I
> went to double check. Please take care to document all changes to the
> patch from the previous review.

Ok. I will make sure change log is current.

> 
>>
>>   arch/x86/boot/compressed/pgtable.h       |  2 +-
>>   arch/x86/include/asm/realmode.h          | 10 +++++++
>>   arch/x86/kernel/smpboot.c                |  2 +-
>>   arch/x86/realmode/rm/header.S            |  1 +
>>   arch/x86/realmode/rm/trampoline_64.S     | 38 ++++++++++++++++++++++++
>>   arch/x86/realmode/rm/trampoline_common.S |  7 ++++-
>>   6 files changed, 57 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
>> index 6ff7e81b5628..cc9b2529a086 100644
>> --- a/arch/x86/boot/compressed/pgtable.h
>> +++ b/arch/x86/boot/compressed/pgtable.h
>> @@ -6,7 +6,7 @@
>>   #define TRAMPOLINE_32BIT_PGTABLE_OFFSET        0
>>
>>   #define TRAMPOLINE_32BIT_CODE_OFFSET   PAGE_SIZE
>> -#define TRAMPOLINE_32BIT_CODE_SIZE     0x70
>> +#define TRAMPOLINE_32BIT_CODE_SIZE     0x80
>>
>>   #define TRAMPOLINE_32BIT_STACK_END     TRAMPOLINE_32BIT_SIZE
>>
>> diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
>> index 5db5d083c873..3328c8edb200 100644
>> --- a/arch/x86/include/asm/realmode.h
>> +++ b/arch/x86/include/asm/realmode.h
>> @@ -25,6 +25,7 @@ struct real_mode_header {
>>          u32     sev_es_trampoline_start;
>>   #endif
>>   #ifdef CONFIG_X86_64
>> +       u32     trampoline_start64;
>>          u32     trampoline_pgd;
>>   #endif
>>          /* ACPI S3 wakeup */
>> @@ -88,6 +89,15 @@ static inline void set_real_mode_mem(phys_addr_t mem)
>>          real_mode_header = (struct real_mode_header *) __va(mem);
>>   }
>>
>> +static inline unsigned long get_trampoline_start_ip(void)
> 
> I'd prefer this helper take a 'struct real_mode_header *rmh' as an
> argument rather than assume a global variable.

I am fine with it. But existing inline functions also directly read/writes
the real_mode_header. So I just followed the same format.

I will fix this in next version.

> 
>> +{
>> +#ifdef CONFIG_X86_64
>> +        if (is_tdx_guest())
>> +                return real_mode_header->trampoline_start64;
>> +#endif
>> +       return real_mode_header->trampoline_start;
>> +}
>> +
>>   void reserve_real_mode(void);
>>
>>   #endif /* __ASSEMBLY__ */
>> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
>> index 16703c35a944..0b4dff5e67a9 100644
>> --- a/arch/x86/kernel/smpboot.c
>> +++ b/arch/x86/kernel/smpboot.c
>> @@ -1031,7 +1031,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
>>                         int *cpu0_nmi_registered)
>>   {
>>          /* start_ip had better be page-aligned! */
>> -       unsigned long start_ip = real_mode_header->trampoline_start;
>> +       unsigned long start_ip = get_trampoline_start_ip();
>>
>>          unsigned long boot_error = 0;
>>          unsigned long timeout;
>> diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
>> index 8c1db5bf5d78..2eb62be6d256 100644
>> --- a/arch/x86/realmode/rm/header.S
>> +++ b/arch/x86/realmode/rm/header.S
>> @@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
>>          .long   pa_sev_es_trampoline_start
>>   #endif
>>   #ifdef CONFIG_X86_64
>> +       .long   pa_trampoline_start64
>>          .long   pa_trampoline_pgd;
>>   #endif
>>          /* ACPI S3 wakeup */
>> diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
>> index 84c5d1b33d10..754f8d2ac9e8 100644
>> --- a/arch/x86/realmode/rm/trampoline_64.S
>> +++ b/arch/x86/realmode/rm/trampoline_64.S
>> @@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
>>          ljmpl   $__KERNEL_CS, $pa_startup_64
>>   SYM_CODE_END(startup_32)
>>
>> +SYM_CODE_START(pa_trampoline_compat)
>> +       /*
>> +        * In compatibility mode.  Prep ESP and DX for startup_32, then disable
>> +        * paging and complete the switch to legacy 32-bit mode.
>> +        */
>> +       movl    $rm_stack_end, %esp
>> +       movw    $__KERNEL_DS, %dx
>> +
>> +       movl    $(X86_CR0_NE | X86_CR0_PE), %eax
> 
> Before this patch the startup path did not touch X86_CR0_NE. I assume
> it was added opportunistically for the TDX case? If it is to stay in
> this patch it deserves a code comment / mention in the changelog, or
> it needs to move to the other patch that fixes up the CR0 setup for
> TDX.

I will move X86_CR0_NE related update to the patch that has other
X86_CR0_NE related updates.

> 
> 
>> +       movl    %eax, %cr0
>> +       ljmpl   $__KERNEL32_CS, $pa_startup_32
>> +SYM_CODE_END(pa_trampoline_compat)
>> +
>>          .section ".text64","ax"
>>          .code64
>>          .balign 4
>> @@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
>>          jmpq    *tr_start(%rip)
>>   SYM_CODE_END(startup_64)
>>
>> +SYM_CODE_START(trampoline_start64)
>> +       /*
>> +        * APs start here on a direct transfer from 64-bit BIOS with identity
>> +        * mapped page tables.  Load the kernel's GDT in order to gear down to
>> +        * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
>> +        * segment registers.  Load the zero IDT so any fault triggers a
>> +        * shutdown instead of jumping back into BIOS.
>> +        */
>> +       lidt    tr_idt(%rip)
>> +       lgdt    tr_gdt64(%rip)
>> +
>> +       ljmpl   *tr_compat(%rip)
>> +SYM_CODE_END(trampoline_start64)
>> +
>>          .section ".rodata","a"
>>          # Duplicate the global descriptor table
>>          # so the kernel can live anywhere
>> @@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
>>          .quad   0x00cf93000000ffff      # __KERNEL_DS
>>   SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
>>
>> +SYM_DATA_START(tr_gdt64)
>> +       .short  tr_gdt_end - tr_gdt - 1 # gdt limit
>> +       .long   pa_tr_gdt
>> +       .long   0
>> +SYM_DATA_END(tr_gdt64)
>> +
>> +SYM_DATA_START(tr_compat)
>> +       .long   pa_trampoline_compat
>> +       .short  __KERNEL32_CS
>> +SYM_DATA_END(tr_compat)
>> +
>>          .bss
>>          .balign PAGE_SIZE
>>   SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
>> diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
>> index 5033e640f957..ade7db208e4e 100644
>> --- a/arch/x86/realmode/rm/trampoline_common.S
>> +++ b/arch/x86/realmode/rm/trampoline_common.S
>> @@ -1,4 +1,9 @@
>>   /* SPDX-License-Identifier: GPL-2.0 */
>>          .section ".rodata","a"
>>          .balign 16
>> -SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
>> +
>> +/* .fill cannot be used for size > 8. So use short and quad */
> 
> If there is to be a comment here it should be to clarify why @tr_idt
> is 10 bytes, not necessarily a quirk of the assembler.

Got it. I will fix the comment or remove it.

> 
>> +SYM_DATA_START_LOCAL(tr_idt)
> 
> The .fill restriction is only for @size, not @repeat. So, what's wrong
> with SYM_DATA_LOCAL(tr_idt, .fill 2, 5, 0)?

Any reason to prefer above change over previous code ?

SYM_DATA_START_LOCAL(tr_idt)
         .short  0
         .quad   0
SYM_DATA_END(tr_idt)

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-05-18  2:53         ` Kuppuswamy, Sathyanarayanan
@ 2021-05-18  4:08           ` Dan Williams
  2021-05-20  0:18             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-18  4:08 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Sean Christopherson, Kai Huang

On Mon, May 17, 2021 at 7:53 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
>
>
> On 5/17/21 7:06 PM, Dan Williams wrote:
> > I notice that you have [RFC v2-fix 1/1] as the prefix for this patch.
> > b4 recently gained support for partial series re-rolls [1], but I
> > think you would need to bump the version number [RFC PATCH v3 21/32]
> > and maintain the patch numbering. In this case with changes moving
> > between patches, and those other patches being squashed any chance of
> > automated reconstruction of this series is likely lost.
>
> Ok. I will make sure to bump the version in next partial re-roll.
>
> If I am fixing this patch as per your comments, do I need bump the
> patch version for it as well?

I don't think it matters too much in this case as I don't think I can
use b4 to assemble this series. So just for future reference on other
patch sets. That said, I wouldn't mind a link to your work-in-progress
branch to see all the changes together in one place.

[..]
> > I'd prefer this helper take a 'struct real_mode_header *rmh' as an
> > argument rather than assume a global variable.
>
> I am fine with it. But existing inline functions also directly read/writes
> the real_mode_header. So I just followed the same format.

I notice the SEV-ES code passes an @rmh variable around for this purpose.

[..]
> > If there is to be a comment here it should be to clarify why @tr_idt
> > is 10 bytes, not necessarily a quirk of the assembler.
>
> Got it. I will fix the comment or remove it.
>
> >
> >> +SYM_DATA_START_LOCAL(tr_idt)
> >
> > The .fill restriction is only for @size, not @repeat. So, what's wrong
> > with SYM_DATA_LOCAL(tr_idt, .fill 2, 5, 0)?
>
> Any reason to prefer above change over previous code ?

What I'm really after is capturing why this size needs to be adjusted
for future reference. Maybe it's plainly obvious to someone who has
worked with this code, but it was not immediately obvious to me.

>
> SYM_DATA_START_LOCAL(tr_idt)
>          .short  0
>          .quad   0
> SYM_DATA_END(tr_idt)

This format implies that tr_idt is reserving space for 2 distinct data
structure attributes of those sizes, can you just put those names here
as comments? Otherwise the .fill format is more compact.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18  0:48     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-05-18 15:00       ` Dave Hansen
  2021-05-18 15:56         ` Andi Kleen
  2021-05-18 16:18         ` Sean Christopherson
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 15:00 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

On 5/17/21 5:48 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> In traditional VMs, MMIO tends to be implemented by giving a
> guest access to a mapping which will cause a VMEXIT on access.
> That's not possible in TDX guest.

Why is it not possible?

> So use #VE to implement MMIO support. In TDX guest, MMIO triggers #VE
> with EPT_VIOLATION exit reason.

What does the #VE handler do to resolve the exception?

> For now we only handle a subset of instructions that the kernel
> uses for MMIO operations. User-space access triggers SIGBUS.

How do you know which instructions the kernel uses?  How do you know
that the compiler won't change them?

I guess the kernel won't boot far if this happens, but this still sounds
like trial-and-error programming.

> Also, reasons for supporting #VE based MMIO in TDX guest are,
> 
> * MMIO is widely used and we'll have more drivers in the future.

OK, but you've also made a big deal about having to go explicitly audit
these drivers.  I would imagine converting these over to stop using MMIO
would be _relatively_ minor compared to a big security audit and new
fuzzing infrastructure.

> * We don't want to annotate every TDX specific MMIO readl/writel etc.

				    ^ TDX-specific

> * If we didn't annotate we would need to add an alternative to every
>   MMIO access in the kernel (even though 99.9% will never be used on
>   TDX) which would be a complete waste and incredible binary bloat
>   for nothing.

That sounds like something objective we can measure.  Does this cost 1
byte of extra text per readl/writel?  10?  100?

You're also being rather indirect about what solutions you ruled out.
Why not just say: we considered doing ____, but ruled that out because
it would have required ____.  Above you just tell us what the solution
required without mentioning the solution.

> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index b9e3010987e0..9330c7a9ad69 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -5,6 +5,8 @@
>  
>  #include <asm/tdx.h>
>  #include <asm/vmx.h>
> +#include <asm/insn.h>
> +#include <linux/sched/signal.h> /* force_sig_fault() */
>  
>  #include <linux/cpu.h>
>  #include <linux/protected_guest.h>
> @@ -209,6 +211,101 @@ static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
>  	}
>  }
>  
> +static unsigned long tdg_mmio(int size, bool write, unsigned long addr,
> +		unsigned long val)
> +{
> +	return tdx_hypercall_out_r11(EXIT_REASON_EPT_VIOLATION, size,
> +				     write, addr, val);
> +}
> +
> +static inline void *get_reg_ptr(struct pt_regs *regs, struct insn *insn)
> +{
> +	static const int regoff[] = {
> +		offsetof(struct pt_regs, ax),
> +		offsetof(struct pt_regs, cx),
> +		offsetof(struct pt_regs, dx),
> +		offsetof(struct pt_regs, bx),
> +		offsetof(struct pt_regs, sp),
> +		offsetof(struct pt_regs, bp),
> +		offsetof(struct pt_regs, si),
> +		offsetof(struct pt_regs, di),
> +		offsetof(struct pt_regs, r8),
> +		offsetof(struct pt_regs, r9),
> +		offsetof(struct pt_regs, r10),
> +		offsetof(struct pt_regs, r11),
> +		offsetof(struct pt_regs, r12),
> +		offsetof(struct pt_regs, r13),
> +		offsetof(struct pt_regs, r14),
> +		offsetof(struct pt_regs, r15),
> +	};
> +	int regno;
> +
> +	regno = X86_MODRM_REG(insn->modrm.value);
> +	if (X86_REX_R(insn->rex_prefix.value))
> +		regno += 8;
> +
> +	return (void *)regs + regoff[regno];
> +}

Was there a reason you copied and pasted this from get_reg_offset()
instead of refactoring?  This looks like almost entirely a subset of
get_reg_offset().

> +static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> +{
> +	int size;
> +	bool write;
> +	unsigned long *reg;
> +	struct insn insn;
> +	unsigned long val = 0;
> +
> +	/*
> +	 * User mode would mean the kernel exposed a device directly
> +	 * to ring3, which shouldn't happen except for things like
> +	 * DPDK.
> +	 */

Uhh....

	https://www.kernel.org/doc/html/v4.14/driver-api/uio-howto.html

I thought there were more than a few ways that userspace could get
access to MMIO mappings.

Also, do most people know what DPDK is?  Should we even be talking about
silly out-of-tree kernel bypass schemes in kernel comments?

> +	if (user_mode(regs)) {
> +		pr_err("Unexpected user-mode MMIO access.\n");
> +		force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);

						       extra space ^

Is a non-ratelimited pr_err() appropriate here?  I guess there shouldn't
be any MMIO passthrough to userspace on these systems.

> +		return 0;
> +	}
> +
> +	kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
> +	insn_get_length(&insn);
> +	insn_get_opcode(&insn);
> +
> +	write = ve->exit_qual & 0x2;
> +
> +	size = insn.opnd_bytes;
> +	switch (insn.opcode.bytes[0]) {
> +	/* MOV r/m8	r8	*/
> +	case 0x88:
> +	/* MOV r8	r/m8	*/
> +	case 0x8A:
> +	/* MOV r/m8	imm8	*/
> +	case 0xC6:

FWIW, I find that *REALLY* hard to read.

Check out is_string_insn() for a more readable example.

Oh, and I misread that.  I read it as "these are all the opcodes we care
about".  When, in fact, I _think_ it's all the opcodes that don't have a
size in insn.opnd_bytes.

Could you spell that out, please?

> +		size = 1;
> +		break;
> +	}
> +
> +	if (inat_has_immediate(insn.attr)) {
> +		BUG_ON(!write);
> +		val = insn.immediate.value;

This is pretty interesting.  This won't work with implicit accesses.  I
guess the limited opcodes above limit how much imprecision will result.
 But, it would still be nice to hear something about that.

For instance, if someone pointed a mid-level page table to MMIO, we'd
get a va->gpa that had zero to do with the instruction.  Granted, that's
only going to happen if something bonkers is going on, but maybe I'm
missing some simpler cases of implicit accesses.

> +		tdg_mmio(size, write, ve->gpa, val);

What happens if this is an MMIO operation that *partially* touches MMIO
and partially touches normal memory?  Let's say I wrote two bytes
(0x1234), starting at the last byte of a RAM page that ran over into an
MMIO page.  The fault would occur trying to write 0x34 to the MMIO, but
the instruction cracking would result in trying to write 0x1234 into the
MMIO.

It doesn't seem *that* outlandish that an MMIO might cross a page
boundary.  Would this work for a two-byte MMIO that crosses a page?

> +		return insn.length;
> +	}
> +
> +	BUG_ON(!inat_has_modrm(insn.attr));

A comment would be nice here about the BUG_ON().

It would also be nice to give a high-level view of what's going on and
what we know about the instruction at this point.

> +	reg = get_reg_ptr(regs, &insn);
> +
> +	if (write) {
> +		memcpy(&val, reg, size);
> +		tdg_mmio(size, write, ve->gpa, val);
> +	} else {
> +		val = tdg_mmio(size, write, ve->gpa, val);
> +		memset(reg, 0, size);
> +		memcpy(reg, &val, size);
> +	}
> +	return insn.length;
> +}
> +
>  unsigned long tdg_get_ve_info(struct ve_info *ve)
>  {
>  	u64 ret;
> @@ -258,6 +355,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
>  	case EXIT_REASON_IO_INSTRUCTION:
>  		tdg_handle_io(regs, ve->exit_qual);
>  		break;
> +	case EXIT_REASON_EPT_VIOLATION:
> +		ve->instr_len = tdg_handle_mmio(regs, ve);
> +		break;
>  	default:
>  		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
>  		return -EFAULT;
> 


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-18  0:09     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-05-18 15:11       ` Dave Hansen
  2021-05-18 15:45         ` Andi Kleen
  2021-05-21 18:45       ` [RFC v2-fix " Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 15:11 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson

On 5/17/21 5:09 PM, Kuppuswamy Sathyanarayanan wrote:
> After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
> although we don't expect it to happen because we don't expect NMIs to
> trigger #VEs. Another case where they could happen is if the #VE
> exception panics, but in this case there are no guarantees on anything
> anyways.

This implies: "we do not expect any NMI to do MMIO".  Is that true?  Why?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-18 15:11       ` Dave Hansen
@ 2021-05-18 15:45         ` Andi Kleen
  2021-05-18 15:56           ` Dave Hansen
  2021-05-21 19:22           ` Dan Williams
  0 siblings, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 15:45 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson


On 5/18/2021 8:11 AM, Dave Hansen wrote:
> On 5/17/21 5:09 PM, Kuppuswamy Sathyanarayanan wrote:
>> After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
>> although we don't expect it to happen because we don't expect NMIs to
>> trigger #VEs. Another case where they could happen is if the #VE
>> exception panics, but in this case there are no guarantees on anything
>> anyways.
> This implies: "we do not expect any NMI to do MMIO".  Is that true?  Why?

Only drivers that are not supported in TDX anyways could do it (mainly 
watchdog drivers)

panic is an exception, but that has been already covered.

-Andi




^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18  0:15                 ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-05-18 15:51                   ` Dave Hansen
  2021-05-18 16:23                     ` Sean Christopherson
  2021-05-18 20:12                     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 15:51 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Isaku Yamahata

Question for KVM folks: Should all of these guest patches say:
"x86/tdx/guest:" or something?  It seems like that would put us all in
the right frame of mind as we review these.  It's kinda easy (for me at
least) to get lost about which side I'm looking at sometimes.

On 5/17/21 5:15 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> KVM hypercalls use the "vmcall" or "vmmcall" instructions.
> Although the ABI is similar, those instructions no longer
> function for TDX guests. Make vendor specififc TDVMCALLs

				"vendor-specific"

		    Hyphen and spelling ^

> instead of VMCALL.

This would also be a great place to say:

This enables TDX guests to run with KVM acting as the hypervisor.  TDX
guests running under other hypervisors will continue to use those
hypervisors hypercalls.

> [Isaku: proposed KVM VENDOR string]
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

This SoB chain is odd.  Kirill wrote this, sent it to Isaku, who sent it
to Sathya?

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9e0e0ff76bab..768df1b98487 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -886,6 +886,12 @@ config INTEL_TDX_GUEST
>  	  run in a CPU mode that protects the confidentiality of TD memory
>  	  contents and the TD’s CPU state from other software, including VMM.
>  
> +config INTEL_TDX_GUEST_KVM
> +	def_bool y
> +	depends on KVM_GUEST && INTEL_TDX_GUEST
> +	help
> +	 This option enables KVM specific hypercalls in TDX guest.

For something that's not user-visible, I'd probably just add a Kconfig
comment rather than help text.

...
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 7966c10ea8d1..a90fec004844 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -128,6 +128,7 @@ obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
>  
>  obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
>  obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
> +obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o

Is the indentation consistent with the other items near "tdx-kvm.o" in
the Makefile?

...
> +/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
> +long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
> +		unsigned long p3, unsigned long p4)
> +{
> +	return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
> +}
> +EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);

I always forget that KVM code is goofy and needs to have things in C
files so you can export the symbols.  Could you add a sentence to the
changelog to this effect?

Code-wise, this is fine.  Just a few tweaks and I'll be happy to ack
this one.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-18 15:45         ` Andi Kleen
@ 2021-05-18 15:56           ` Dave Hansen
  2021-05-18 16:00             ` Andi Kleen
  2021-05-21 19:22           ` Dan Williams
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 15:56 UTC (permalink / raw)
  To: Andi Kleen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson

On 5/18/21 8:45 AM, Andi Kleen wrote:
> 
> On 5/18/2021 8:11 AM, Dave Hansen wrote:
>> On 5/17/21 5:09 PM, Kuppuswamy Sathyanarayanan wrote:
>>> After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
>>> although we don't expect it to happen because we don't expect NMIs to
>>> trigger #VEs. Another case where they could happen is if the #VE
>>> exception panics, but in this case there are no guarantees on anything
>>> anyways.
>> This implies: "we do not expect any NMI to do MMIO".  Is that true?  Why?
> 
> Only drivers that are not supported in TDX anyways could do it (mainly
> watchdog drivers)

No APIC access either?

Also, shouldn't we have at least a:

	WARN_ON_ONCE(in_nmi());

if we don't expect (or handle well) #VE in NMIs?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 15:00       ` Dave Hansen
@ 2021-05-18 15:56         ` Andi Kleen
  2021-05-18 16:04           ` Dave Hansen
  2021-05-18 17:11           ` Sean Christopherson
  2021-05-18 16:18         ` Sean Christopherson
  1 sibling, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 15:56 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel


On 5/18/2021 8:00 AM, Dave Hansen wrote:
> On 5/17/21 5:48 PM, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> In traditional VMs, MMIO tends to be implemented by giving a
>> guest access to a mapping which will cause a VMEXIT on access.
>> That's not possible in TDX guest.
> Why is it not possible?

For once the TDX module doesn't support uncached mappings (IgnorePAT is 
always 1)




>
>> For now we only handle a subset of instructions that the kernel
>> uses for MMIO operations. User-space access triggers SIGBUS.
> How do you know which instructions the kernel uses?

They're all in MMIO macros.


>   How do you know
> that the compiler won't change them?

The macros try hard to prevent that because it would likely break real 
MMIO too.

Besides it works for others, like AMD-SEV today and of course all the 
hypervisors that do the same.




> That sounds like something objective we can measure.  Does this cost 1
> byte of extra text per readl/writel?  10?  100?

Alternatives are at least a pointer, but also the extra alternative 
code. It's definitely more than 10, I would guess 40+



>
> I thought there were more than a few ways that userspace could get
> access to MMIO mappings.

Yes and they will all fault in TDX guests.


>> +	if (user_mode(regs)) {
>> +		pr_err("Unexpected user-mode MMIO access.\n");
>> +		force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);
> 						       extra space ^
>
> Is a non-ratelimited pr_err() appropriate here?  I guess there shouldn't
> be any MMIO passthrough to userspace on these systems.
Yes rate limiting makes sense.


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-18 15:56           ` Dave Hansen
@ 2021-05-18 16:00             ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 16:00 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson


> No APIC access either?


It's all X2APIC inside TDX which uses MSRs

>
> Also, shouldn't we have at least a:
>
> 	WARN_ON_ONCE(in_nmi());
>
> if we don't expect (or handle well) #VE in NMIs?

We handle it perfectly fine. It's just not needed.


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 15:56         ` Andi Kleen
@ 2021-05-18 16:04           ` Dave Hansen
  2021-05-18 16:10             ` Andi Kleen
  2021-05-18 17:11           ` Sean Christopherson
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 16:04 UTC (permalink / raw)
  To: Andi Kleen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 5/18/21 8:56 AM, Andi Kleen wrote:
> On 5/18/2021 8:00 AM, Dave Hansen wrote:
>> On 5/17/21 5:48 PM, Kuppuswamy Sathyanarayanan wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>
>>> In traditional VMs, MMIO tends to be implemented by giving a
>>> guest access to a mapping which will cause a VMEXIT on access.
>>> That's not possible in TDX guest.
>> Why is it not possible?
> 
> For once the TDX module doesn't support uncached mappings (IgnorePAT is
> always 1)

Actually, I was thinking more along the lines of why the architecture
doesn't have VMEXITs:  VMEXITs expose guest state to the host and VMMs
use that state to emulate MMIO.  TDX guests don't trust the host and
can't have that arbitrary state exposed to the host.  So, they sanitize
the state in the #VE handler and make a *controlled* transition into the
host with a TDCALL rather than an uncontrolled VMEXIT.

>>> For now we only handle a subset of instructions that the kernel
>>> uses for MMIO operations. User-space access triggers SIGBUS.
>> How do you know which instructions the kernel uses?
> 
> They're all in MMIO macros.

I've heard exactly the opposite from the TDX team in the past.  What I
remember was a claim that one can not just leverage the MMIO macros as a
single point to avoid MMIO.  I remember being told that not all code in
the kernel that does MMIO uses these macros.  APIC MMIO's were called
out as a place that does not use the MMIO macros.

I'm confused now.

>>   How do you know that the compiler won't change them?
> 
> The macros try hard to prevent that because it would likely break real
> MMIO too.
> 
> Besides it works for others, like AMD-SEV today and of course all the
> hypervisors that do the same.

That would be some excellent information for the changelog.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 16:04           ` Dave Hansen
@ 2021-05-18 16:10             ` Andi Kleen
  2021-05-18 16:22               ` Dave Hansen
  2021-05-18 17:28               ` Andi Kleen
  0 siblings, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 16:10 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel


>>>> For now we only handle a subset of instructions that the kernel
>>>> uses for MMIO operations. User-space access triggers SIGBUS.
>>> How do you know which instructions the kernel uses?
>> They're all in MMIO macros.
> I've heard exactly the opposite from the TDX team in the past.  What I
> remember was a claim that one can not just leverage the MMIO macros as a
> single point to avoid MMIO.  I remember being told that not all code in
> the kernel that does MMIO uses these macros.  APIC MMIO's were called
> out as a place that does not use the MMIO macros.

Yes x86 APIC has its own macros, but we don't use the MMIO based APIC, 
only X2APIC in TDX.

I'm not aware of any other places that would do MMIO without using the 
standard io.h macros, although it might happen in theory on x86 (but 
would likely break on some other architectures)


-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 15:00       ` Dave Hansen
  2021-05-18 15:56         ` Andi Kleen
@ 2021-05-18 16:18         ` Sean Christopherson
  2021-05-18 17:15           ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 16:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On Tue, May 18, 2021, Dave Hansen wrote:
> On 5/17/21 5:48 PM, Kuppuswamy Sathyanarayanan wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > In traditional VMs, MMIO tends to be implemented by giving a
> > guest access to a mapping which will cause a VMEXIT on access.
> > That's not possible in TDX guest.
> 
> Why is it not possible?

It is possible, and in fact KVM will cause a VM-Exit on the first access to a
given MMIO page.  The problem is that guest state is inaccessible and so the VMM
cannot do the front end of MMIO instruction emulation.

> > So use #VE to implement MMIO support. In TDX guest, MMIO triggers #VE
> > with EPT_VIOLATION exit reason.

It's more accurate to say that the VMM will configure EPT entries for pages that
require instruction emulation to cause #VE.

> What does the #VE handler do to resolve the exception?
> 
> > For now we only handle a subset of instructions that the kernel
> > uses for MMIO operations. User-space access triggers SIGBUS.
> 
> How do you know which instructions the kernel uses?  How do you know
> that the compiler won't change them?
>
> I guess the kernel won't boot far if this happens, but this still sounds
> like trial-and-error programming.

If a driver accesses MMIO through a struct overlay, all bets are off.  The I/O
APIC code does this, but that problem is "solved" by forcefully disabling the
I/O APIC since it's useless for the current incarnation of TDX.  IIRC, some of
the console code also accesses MMIO via a struct (or maybe just through generic
C code), and the compiler does indeed employ a wider variety of instructions.

So yeah, whack-a-mole.
 
> > Also, reasons for supporting #VE based MMIO in TDX guest are,
> > 
> > * MMIO is widely used and we'll have more drivers in the future.
> 
> OK, but you've also made a big deal about having to go explicitly audit
> these drivers.  I would imagine converting these over to stop using MMIO
> would be _relatively_ minor compared 

For drivers that use the kernel's macros, converting them to use TDVMCALL
directly will be trivial and shouldn't even require any modifications to the
driver.  For drivers that use a struct overlay or generic C code, the "conversion"
could require a complete rewrite of the driver.

> to a big security audit and new fuzzing infrastructure.
> 
> > * We don't want to annotate every TDX specific MMIO readl/writel etc.
> 
> 				    ^ TDX-specific
> 
> > * If we didn't annotate we would need to add an alternative to every
> >   MMIO access in the kernel (even though 99.9% will never be used on
> >   TDX) which would be a complete waste and incredible binary bloat
> >   for nothing.
> 
> That sounds like something objective we can measure.  Does this cost 1
> byte of extra text per readl/writel?  10?  100?

Agreed.  And IMO, it's worth converting the common case (macros) if the overhead
is acceptable, while leaving the #VE handling in place for non-standard code.

> > +static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 
...

> > +		return 0;
> > +	}
> > +
> > +	kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
> > +	insn_get_length(&insn);
> > +	insn_get_opcode(&insn);
> > +
> > +	write = ve->exit_qual & 0x2;
> > +
> > +	size = insn.opnd_bytes;
> > +	switch (insn.opcode.bytes[0]) {
> > +	/* MOV r/m8	r8	*/
> > +	case 0x88:
> > +	/* MOV r8	r/m8	*/
> > +	case 0x8A:
> > +	/* MOV r/m8	imm8	*/
> > +	case 0xC6:
> 
> FWIW, I find that *REALLY* hard to read.

Why does this code exist at all?  TDX and SEV-ES absolutely must share code for
handling MMIO reflection.  It will require a fair amount of refactoring to move
the guts of vc_handle_mmio() to common code, but there is zero reason to maintain
two separate versions of the opcode cracking.

Ditto for string I/O in vc_handle_ioio().

> What happens if this is an MMIO operation that *partially* touches MMIO
> and partially touches normal memory?  Let's say I wrote two bytes
> (0x1234), starting at the last byte of a RAM page that ran over into an
> MMIO page.  The fault would occur trying to write 0x34 to the MMIO, but
> the instruction cracking would result in trying to write 0x1234 into the
> MMIO.
> 
> It doesn't seem *that* outlandish that an MMIO might cross a page
> boundary.  Would this work for a two-byte MMIO that crosses a page?

I'm pretty sure we can get away with panic (kernel) and SIGBUS (userspace) on
a reflected memory access that splits a page.  Yes, it's theoretically possible
and probably even "works", but practically speaking no emulated MMIO device is
going to have a single logical register/thing split a page, and I can't think of
any reason to allow accessing multiple registers/things across a page split.

The existing SEV-ES #VC handlers appear to be missing page split checks, so that
needs to be fixed.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 16:10             ` Andi Kleen
@ 2021-05-18 16:22               ` Dave Hansen
  2021-05-18 17:05                 ` Andi Kleen
  2021-05-18 17:28               ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 16:22 UTC (permalink / raw)
  To: Andi Kleen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 5/18/21 9:10 AM, Andi Kleen wrote:
> I'm not aware of any other places that would do MMIO without using the
> standard io.h macros, although it might happen in theory on x86 (but
> would likely break on some other architectures)

Can we please connect all of the dots and turn this into a coherent
changelog?

 * In-kernel MMIO is handled via exceptions (#VE) and instruction
   cracking
 * Arbitrary MMIO instructions are not handled (and would result in...)
 * The limited set of MMIO instructions that are handled are known and
   come from the io.h macros, ultimately build_mmio_read/write().
 * This approach is also used for SEV-ES???
 * Some x86 code that avoids the MMIO code is known to exist (APIC).
   But, this code is not used in TDX guests

BTW, in perusing arch/x86/include/asm/io.h, I was reminded of movdir64b.
 That seems like one we'd want to take care of sooner rather than later.
 Or, do we expect the first folks who expose a movdir64b-using driver to
TDX to go and update this code?

Also, the sev_key_active() stuff in there makes me nervous.  Does this
scheme work with these:

> static inline void outs##bwl(int port, const void *addr, unsigned long count) \
> static inline void ins##bwl(int port, void *addr, unsigned long count)  \

?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18 15:51                   ` Dave Hansen
@ 2021-05-18 16:23                     ` Sean Christopherson
  2021-05-18 20:12                     ` Kuppuswamy, Sathyanarayanan
  1 sibling, 0 replies; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 16:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel, Isaku Yamahata

On Tue, May 18, 2021, Dave Hansen wrote:
> Question for KVM folks: Should all of these guest patches say:
> "x86/tdx/guest:" or something?

x86/tdx is fine.  The KVM convention is to use "KVM: xxx:" for KVM host code and
"x86/kvm" for KVM guest code.  E.g. for KVM TDX host code, the subjects will be
"KVM: x86:", "KVM: VMX:" or "KVM: TDX:".

The one I really don't like is using "tdg_" as the acronym for guest functions.
I find that really confusion and grep-unfriendly.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 16:22               ` Dave Hansen
@ 2021-05-18 17:05                 ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 17:05 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel


>   Or, do we expect the first folks who expose a movdir64b-using driver to
> TDX to go and update this code?

That's what we want to do.


>
> Also, the sev_key_active() stuff in there makes me nervous.  Does this
> scheme work with these:
>
>> static inline void outs##bwl(int port, const void *addr, unsigned long count) \
>> static inline void ins##bwl(int port, void *addr, unsigned long count)  \
> ?


This is not MMIO, but port IO. We do similar changes as AMD for TDX.


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 15:56         ` Andi Kleen
  2021-05-18 16:04           ` Dave Hansen
@ 2021-05-18 17:11           ` Sean Christopherson
  2021-05-18 17:21             ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 17:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On Tue, May 18, 2021, Andi Kleen wrote:
> On 5/18/2021 8:00 AM, Dave Hansen wrote:
> > That sounds like something objective we can measure.  Does this cost 1
> > byte of extra text per readl/writel?  10?  100?
> 
> Alternatives are at least a pointer, but also the extra alternative code.
> It's definitely more than 10, I would guess 40+

The extra bytes for .altinstructions is very different than the extra bytes for
the code itself.  The .altinstructions section is freed after init, so yes it
bloats the kernel size a bit, but the runtime footprint is unaffected by the
patching metadata.

IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.

The other option to explore is to hook/patch IO_COND(), which can be done with
neglible overhead because the helpers that use IO_COND() are not inlined.  In a
TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
And if there are TDX VMMs that want to deploy virtio-mmio, hooking
drivers/virtio/virtio_mmio.c directly would be a viable option.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 16:18         ` Sean Christopherson
@ 2021-05-18 17:15           ` Andi Kleen
  2021-05-18 18:17             ` Sean Christopherson
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 17:15 UTC (permalink / raw)
  To: Sean Christopherson, Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel


>>> * If we didn't annotate we would need to add an alternative to every
>>>    MMIO access in the kernel (even though 99.9% will never be used on
>>>    TDX) which would be a complete waste and incredible binary bloat
>>>    for nothing.
>> That sounds like something objective we can measure.  Does this cost 1
>> byte of extra text per readl/writel?  10?  100?
> Agreed.  And IMO, it's worth converting the common case (macros) if the overhead
> is acceptable, while leaving the #VE handling in place for non-standard code.

We have many millions of lines of MMIO using driver code in the kernel 
99.99% of which never runs in TDX. I don't see any point in impacting 
everything for this. That would be just against all good code change 
hygiene practices, and also just be bloated.

But we also don't don't want to touch every driver, for similar reasons.

What I think would make sense is to convert something to a direct TDCALL 
if we figure out the extra #VE is a real life performance problem. AFAIK 
the only candidate that I have in mind for this is the virtio doorbell 
write (and potentially later its VMBus equivalent). But we should really 
only do that if some measurements show it's needed.



> Why does this code exist at all?  TDX and SEV-ES absolutely must share code for
> handling MMIO reflection.  It will require a fair amount of refactoring to move
> the guts of vc_handle_mmio() to common code, but there is zero reason to maintain
> two separate versions of the opcode cracking.

While that's true on the high level, all the low level details are 
different. We looked at unifying at some point, but it would have been a 
callback hell. I don't think unifying would make anything cleaner.

Besides the bulk of the decoding work is already unified in the common 
x86 instruction decoder. The actual actions are different, and the code 
fetching is also different, so on the rest there isn't that much to unify.


> The existing SEV-ES #VC handlers appear to be missing page split checks, so that
> needs to be fixed.

Only if anyone in the kernel actually relies on it?


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 17:11           ` Sean Christopherson
@ 2021-05-18 17:21             ` Andi Kleen
  2021-05-18 17:46               ` Dave Hansen
  2021-05-18 18:22               ` Sean Christopherson
  0 siblings, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 17:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel


> The extra bytes for .altinstructions is very different than the extra bytes for
> the code itself.  The .altinstructions section is freed after init, so yes it
> bloats the kernel size a bit, but the runtime footprint is unaffected by the
> patching metadata.
>
> IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.
>
> The other option to explore is to hook/patch IO_COND(), which can be done with
> neglible overhead because the helpers that use IO_COND() are not inlined.  In a
> TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
> majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
> And if there are TDX VMMs that want to deploy virtio-mmio, hooking
> drivers/virtio/virtio_mmio.c directly would be a viable option.

Yes but what's the point of all that?

Even if it's only 3 bytes we still have a lot of MMIO all over the 
kernel which never needs it.

And I don't even see what TDX (or SEV which already does the decoding 
and has been merged) would get out of it. We handle all the #VEs just 
fine. And the instruction handling code is fairly straight forward too.

Besides instruction decoding works fine for all the existing 
hypervisors. All we really want to do is to do the same thing as KVM 
would do.

-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 16:10             ` Andi Kleen
  2021-05-18 16:22               ` Dave Hansen
@ 2021-05-18 17:28               ` Andi Kleen
  1 sibling, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 17:28 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel


On 5/18/2021 9:10 AM, Andi Kleen wrote:
>
>>>>> For now we only handle a subset of instructions that the kernel
>>>>> uses for MMIO operations. User-space access triggers SIGBUS.
>>>> How do you know which instructions the kernel uses?
>>> They're all in MMIO macros.
>> I've heard exactly the opposite from the TDX team in the past. What I
>> remember was a claim that one can not just leverage the MMIO macros as a
>> single point to avoid MMIO.  I remember being told that not all code in
>> the kernel that does MMIO uses these macros.  APIC MMIO's were called
>> out as a place that does not use the MMIO macros.
>
> Yes x86 APIC has its own macros, but we don't use the MMIO based APIC, 
> only X2APIC in TDX.

I must correct myself here. We actually use #VE to handle MSRs, or at 
least those that are not context switched by the TDX module. So there 
can be #VE nested in NMI in normal operation, since MSR accesses in NMI 
can happen.

I don't think it needs any changes to the code -- this should all work 
-- but we need to update the commit log to document this case.


-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 17:21             ` Andi Kleen
@ 2021-05-18 17:46               ` Dave Hansen
  2021-05-18 18:36                 ` Sean Christopherson
  2021-05-18 20:20                 ` Andi Kleen
  2021-05-18 18:22               ` Sean Christopherson
  1 sibling, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 17:46 UTC (permalink / raw)
  To: Andi Kleen, Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

On 5/18/21 10:21 AM, Andi Kleen wrote:
> Besides instruction decoding works fine for all the existing
> hypervisors. All we really want to do is to do the same thing as KVM
> would do.

Dumb question of the day: If you want to do the same thing that KVM
does, why don't you share more code with KVM?  Wouldn't you, for
instance, need to crack the same instruction opcodes?

I'd feel a lot better about this if you said:

	Listen, this doesn't work for everything.  But, it will run
	every single driver as a TDX guest that KVM can handle as a
	host.  So, if the TDX code is broken, so is the KVM host code.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 17:15           ` Andi Kleen
@ 2021-05-18 18:17             ` Sean Christopherson
  2021-05-20 22:47               ` Kirill A. Shutemov
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 18:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On Tue, May 18, 2021, Andi Kleen wrote:
> > Why does this code exist at all?  TDX and SEV-ES absolutely must share code for
> > handling MMIO reflection.  It will require a fair amount of refactoring to move
> > the guts of vc_handle_mmio() to common code, but there is zero reason to maintain
> > two separate versions of the opcode cracking.
> 
> While that's true on the high level, all the low level details are
> different. We looked at unifying at some point, but it would have been a
> callback hell. I don't think unifying would make anything cleaner.

How hard did you look?  The only part that _must_ be different between SEV and
TDX is the hypercall itself, which is wholly contained at the very end of
vc_do_mmio().

Despite vc_slow_virt_to_phys() taking a pointer to the ghcb, it's unused and
thus the function is 100% generic.

The ghcb->shared_buffer usage throughout the upper levels can be eliminated by
refactoring the stack to take a "u64 *val", since MMIO accesses are currently
bounded to 8 bytes.

> Besides the bulk of the decoding work is already unified in the common x86
> instruction decoder. The actual actions are different, and the code fetching
> is also different 

Huh?  What do you mean by "actual actions"?  Why is the code fetch different?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 17:21             ` Andi Kleen
  2021-05-18 17:46               ` Dave Hansen
@ 2021-05-18 18:22               ` Sean Christopherson
  2021-05-18 20:28                 ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 18:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On Tue, May 18, 2021, Andi Kleen wrote:
> 
> > The extra bytes for .altinstructions is very different than the extra bytes for
> > the code itself.  The .altinstructions section is freed after init, so yes it
> > bloats the kernel size a bit, but the runtime footprint is unaffected by the
> > patching metadata.
> > 
> > IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.
> > 
> > The other option to explore is to hook/patch IO_COND(), which can be done with
> > neglible overhead because the helpers that use IO_COND() are not inlined.  In a
> > TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
> > majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
> > And if there are TDX VMMs that want to deploy virtio-mmio, hooking
> > drivers/virtio/virtio_mmio.c directly would be a viable option.
> 
> Yes but what's the point of all that?

Patching IO_COND() is relatively low effort.  With some clever refactoring, I
suspect the net lines of code added would be less than 10.  That seems like a
worthwhile effort to avoid millions of faults over the lifetime of the guest.

> Even if it's only 3 bytes we still have a lot of MMIO all over the kernel
> which never needs it.
> 
> And I don't even see what TDX (or SEV which already does the decoding and
> has been merged) would get out of it. We handle all the #VEs just fine. And
> the instruction handling code is fairly straight forward too.
> 
> Besides instruction decoding works fine for all the existing hypervisors.
> All we really want to do is to do the same thing as KVM would do.

Heh, trust me, you don't want to do the same thing KVM does :-)

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 17:46               ` Dave Hansen
@ 2021-05-18 18:36                 ` Sean Christopherson
  2021-05-18 20:20                 ` Andi Kleen
  1 sibling, 0 replies; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 18:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On Tue, May 18, 2021, Dave Hansen wrote:
> On 5/18/21 10:21 AM, Andi Kleen wrote:
> > Besides instruction decoding works fine for all the existing
> > hypervisors. All we really want to do is to do the same thing as KVM
> > would do.
> 
> Dumb question of the day: If you want to do the same thing that KVM
> does, why don't you share more code with KVM?  Wouldn't you, for
> instance, need to crack the same instruction opcodes?

Pulling in all pf KVM's emulator is a bad idea from a security perspective.  That
could be mitigated to some extent by teaching the emulator to emulate only select
instructions, but it'd still be much higher risk than a barebones guest-specific
implementations.  Because old Intel CPUs don't support unrestricted guest, the set
of instructions that KVM _can_ emulate in total is far, far larger than what is
needed for MMIO.

Allowed instructions aside, KVM needs to handle a large number things a TDX/SEV
guest does not, e.g. segmentation, CPUID model, A/D bit updates, and so on and
so forth.

Refactoring KVM's emulator would also be a monumental task.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Make DMA pages shared
  2021-05-18  1:19   ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-05-18 19:55     ` Sean Christopherson
  2021-05-18 22:12       ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 19:55 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel, Kai Huang,
	Sean Christopherson

On Mon, May 17, 2021, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Intel TDX doesn't allow VMM to access guest memory. Any memory
                                             ^
                                             |- private

And to be pedantic, the VMM can _access_ guest private memory all it wants, it
just can't decrypt guest private memory.

> that is required for communication with VMM must be shared
> explicitly by setting the bit in page table entry. And, after
> setting the shared bit, the conversion must be completed with
> MapGPA TDVMALL. The call informs VMM about the conversion and
> makes it remove the GPA from the S-EPT mapping.

The VMM is _not_ required to remove the GPA from the S-EPT.  E.g. if the VMM
wants to, it can leave a 2mb private page intact and create a 4kb shared page
translation within the same range (ignoring the shared bit).

> The shared memory is similar to unencrypted memory in AMD SME/SEV
> terminology but the underlying process of sharing/un-sharing the memory is
> different for Intel TDX guest platform.
> 
> SEV assumes that I/O devices can only do DMA to "decrypted"
> physical addresses without the C-bit set.  In order for the CPU
> to interact with this memory, the CPU needs a decrypted mapping.
> To add this support, AMD SME code forces force_dma_unencrypted()
> to return true for platforms that support AMD SEV feature. It will
> be used for DMA memory allocation API to trigger
> set_memory_decrypted() for platforms that support AMD SEV feature.
> 
> TDX is similar.  TDX architecturally prevents access to private

TDX doesn't prevent accesses.  If hardware _prevented_ accesses then we wouldn't
have to deal with the #MC mess.

> guest memory by anything other than the guest itself. This means
> that any DMA buffers must be shared.
> 
> So create a new file mem_encrypt_tdx.c to hold TDX specific memory
> initialization code, and re-define force_dma_unencrypted() for
> TDX guest and make it return true to get DMA pages mapped as shared.
> 
> __set_memory_enc_dec() is now aware about TDX and sets Shared bit
> accordingly following with relevant TDVMCALL.
> 
> Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when

This should call out that the current TDX spec only supports 4kb AUG/ACCEPT.

On that topic... are there plans to support 2mb and/or 1gb TDH.MEM.PAGE.AUG?  If
so, will TDG.MEM.PAGE.ACCEPT also support 2mb/1gb granularity?

> converting memory to private.  If the VMM uses a common pool for private
> and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA
> (or on the first access to the private GPA),

What the VMM does or does not do is irrelevant.  What matters is what the VMM is
_allowed_ to do without violating the GHCI.  Specifically, the VMM is allowed to
unmap a private page in response to MAP_GPA to convert to a shared page.

  If the GPA (range) was already mapped as an active, private page, the host
  VMM may remove the private page from the TD by following the “Removing TD
  Private Pages” sequence in the Intel TDX-module specification [3] to safely
  block the mapping(s), flush the TLB and cache, and remove the mapping(s).

That would also provide a nice segue into the "already accepted" error below.

> in which case TDX-Module will hold the page in a non-present "pending" state
> until it is explicitly accepted.
> 
> BUG() if TDACCEPTPAGE fails (except the above case)

What above case?  The code handles the case where the page was already accepted,
but the changelog doesn't talk about that at all.  

> as the guest is completely hosed if it can't access memory. 


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18 15:51                   ` Dave Hansen
  2021-05-18 16:23                     ` Sean Christopherson
@ 2021-05-18 20:12                     ` Kuppuswamy, Sathyanarayanan
  2021-05-18 20:19                       ` Dave Hansen
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-18 20:12 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Isaku Yamahata



On 5/18/21 8:51 AM, Dave Hansen wrote:
> Question for KVM folks: Should all of these guest patches say:
> "x86/tdx/guest:" or something?  It seems like that would put us all in
> the right frame of mind as we review these.  It's kinda easy (for me at
> least) to get lost about which side I'm looking at sometimes.
> 
> On 5/17/21 5:15 PM, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> KVM hypercalls use the "vmcall" or "vmmcall" instructions.
>> Although the ABI is similar, those instructions no longer
>> function for TDX guests. Make vendor specififc TDVMCALLs
> 
> 				"vendor-specific"
> 
> 		    Hyphen and spelling ^

I will fix it next version.

> 
>> instead of VMCALL.
> 
> This would also be a great place to say:
> 
> This enables TDX guests to run with KVM acting as the hypervisor.  TDX
> guests running under other hypervisors will continue to use those
> hypervisors hypercalls.

I will include it.

> 
>> [Isaku: proposed KVM VENDOR string]
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> This SoB chain is odd.  Kirill wrote this, sent it to Isaku, who sent it
> to Sathya?

Initially we have used "0" as vendor ID for KVM. But Isaku proposed a new
value for it and sent a patch to fix it. But, I did not want to carry it as
separate patch (for one line change). So I have merged his change with
this patch, and added his signed-off with comment ([Isaku: proposed KVM VENDOR string])

+#define TDVMCALL_VENDOR_KVM            0x4d564b2e584454 /* "TDX.KVM" */


> 
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 9e0e0ff76bab..768df1b98487 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -886,6 +886,12 @@ config INTEL_TDX_GUEST
>>   	  run in a CPU mode that protects the confidentiality of TD memory
>>   	  contents and the TD’s CPU state from other software, including VMM.
>>   
>> +config INTEL_TDX_GUEST_KVM
>> +	def_bool y
>> +	depends on KVM_GUEST && INTEL_TDX_GUEST
>> +	help
>> +	 This option enables KVM specific hypercalls in TDX guest.
> 
> For something that's not user-visible, I'd probably just add a Kconfig
> comment rather than help text.

If it is the preferred approach, I can remove it.

> 
> ...
>> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
>> index 7966c10ea8d1..a90fec004844 100644
>> --- a/arch/x86/kernel/Makefile
>> +++ b/arch/x86/kernel/Makefile
>> @@ -128,6 +128,7 @@ obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
>>   
>>   obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
>>   obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
>> +obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o
> 
> Is the indentation consistent with the other items near "tdx-kvm.o" in
> the Makefile?

Yes. For longer config names, common indentation is not maintained. Please
check the PMEM example.

126 obj-$(CONFIG_PARAVIRT_CLOCK)    += pvclock.o
127 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
128
129 obj-$(CONFIG_JAILHOUSE_GUEST)   += jailhouse.o
130 obj-$(CONFIG_INTEL_TDX_GUEST)   += tdcall.o tdx.o
131 obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o


> 
> ...
>> +/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
>> +long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
>> +		unsigned long p3, unsigned long p4)
>> +{
>> +	return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
>> +}
>> +EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);
> 
> I always forget that KVM code is goofy and needs to have things in C
> files so you can export the symbols.  Could you add a sentence to the
> changelog to this effect?
> 
> Code-wise, this is fine.  Just a few tweaks and I'll be happy to ack
> this one.

Will add it.

     Since KVM hypercall functions can be included and called
     from kernel modules, export tdx_kvm_hypercall*() functions
     to avoid symbol errors


> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18 20:12                     ` Kuppuswamy, Sathyanarayanan
@ 2021-05-18 20:19                       ` Dave Hansen
  2021-05-18 20:57                         ` Kuppuswamy, Sathyanarayanan
  2021-05-18 21:19                         ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 20:19 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Isaku Yamahata

On 5/18/21 1:12 PM, Kuppuswamy, Sathyanarayanan wrote:
>>> [Isaku: proposed KVM VENDOR string]
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>>> Signed-off-by: Kuppuswamy Sathyanarayanan
>>> <sathyanarayanan.kuppuswamy@linux.intel.com>
>>
>> This SoB chain is odd.  Kirill wrote this, sent it to Isaku, who sent it
>> to Sathya?
> 
> Initially we have used "0" as vendor ID for KVM. But Isaku proposed a new
> value for it and sent a patch to fix it. But, I did not want to carry it as
> separate patch (for one line change). So I have merged his change with
> this patch, and added his signed-off with comment ([Isaku: proposed KVM
> VENDOR string])
> 
> +#define TDVMCALL_VENDOR_KVM            0x4d564b2e584454 /* "TDX.KVM" */

That's a combined Co-developed-by+Signed-off-by situation.  You don't
add a bare SoB for that.

But, seriously, you don't need to preserve a SoB for a one-line patch.
Just pull the line in and make a note in the changelog.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 17:46               ` Dave Hansen
  2021-05-18 18:36                 ` Sean Christopherson
@ 2021-05-18 20:20                 ` Andi Kleen
  2021-05-18 20:40                   ` Dave Hansen
  1 sibling, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 20:20 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel


On 5/18/2021 10:46 AM, Dave Hansen wrote:
> On 5/18/21 10:21 AM, Andi Kleen wrote:
>> Besides instruction decoding works fine for all the existing
>> hypervisors. All we really want to do is to do the same thing as KVM
>> would do.
> Dumb question of the day: If you want to do the same thing that KVM
> does, why don't you share more code with KVM?  Wouldn't you, for
> instance, need to crack the same instruction opcodes?

We're talking about ~60 lines of codes that calls an established 
standard library.

https://github.com/intel/tdx/blob/8c20c364d1f52e432181d142054b1c2efa0ae6d3/arch/x86/kernel/tdx.c#L490

You're proposing a gigantic refactoring to avoid 60 lines of straight 
forward code.

That's not a practical proposal.

>
> I'd feel a lot better about this if you said:
>
> 	Listen, this doesn't work for everything.  But, it will run
> 	every single driver as a TDX guest that KVM can handle as a
> 	host.  So, if the TDX code is broken, so is the KVM host code.

I don't really know what problem you're trying to solve here. We only 
have a small number of drivers and we tested them and they work fine. 
There are special macros that limit the number of instructions. If there 
are ever more instructions and the macros break somehow we'll add them. 
There will be a clean error if it ever happens. We're not trying to 
solve hypothetical problems here.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 18:22               ` Sean Christopherson
@ 2021-05-18 20:28                 ` Andi Kleen
  2021-05-18 20:37                   ` Sean Christopherson
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 20:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel


On 5/18/2021 11:22 AM, Sean Christopherson wrote:
> On Tue, May 18, 2021, Andi Kleen wrote:
>>> The extra bytes for .altinstructions is very different than the extra bytes for
>>> the code itself.  The .altinstructions section is freed after init, so yes it
>>> bloats the kernel size a bit, but the runtime footprint is unaffected by the
>>> patching metadata.
>>>
>>> IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.
>>>
>>> The other option to explore is to hook/patch IO_COND(), which can be done with
>>> neglible overhead because the helpers that use IO_COND() are not inlined.  In a
>>> TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
>>> majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
>>> And if there are TDX VMMs that want to deploy virtio-mmio, hooking
>>> drivers/virtio/virtio_mmio.c directly would be a viable option.
>> Yes but what's the point of all that?
> Patching IO_COND() is relatively low effort.  With some clever refactoring, I
> suspect the net lines of code added would be less than 10.  That seems like a
> worthwhile effort to avoid millions of faults over the lifetime of the guest.

AFAIK IO_COND is only for iomap users. But most drivers don't even use 
iomap. virtio doesn't for example, and that's really the only case we 
currently care about.

Also millions of faults is nothing for a CPU.

The only case I can see it making sense is the virtio (and vmbus) door 
bells. Everything else should be slow path anyways.

But doing that now would be premature optimization and that's usually a 
bad idea. If it's a problem we can fix it later.


>
>> Even if it's only 3 bytes we still have a lot of MMIO all over the kernel
>> which never needs it.
>>
>> And I don't even see what TDX (or SEV which already does the decoding and
>> has been merged) would get out of it. We handle all the #VEs just fine. And
>> the instruction handling code is fairly straight forward too.
>>
>> Besides instruction decoding works fine for all the existing hypervisors.
>> All we really want to do is to do the same thing as KVM would do.
> Heh, trust me, you don't want to do the same thing KVM does :-)

We want the same behavior.

Yes probably not the same code.


-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 20:28                 ` Andi Kleen
@ 2021-05-18 20:37                   ` Sean Christopherson
  2021-05-18 20:56                     ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 20:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On Tue, May 18, 2021, Andi Kleen wrote:
> 
> On 5/18/2021 11:22 AM, Sean Christopherson wrote:
> > On Tue, May 18, 2021, Andi Kleen wrote:
> > > > The extra bytes for .altinstructions is very different than the extra bytes for
> > > > the code itself.  The .altinstructions section is freed after init, so yes it
> > > > bloats the kernel size a bit, but the runtime footprint is unaffected by the
> > > > patching metadata.
> > > > 
> > > > IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.
> > > > 
> > > > The other option to explore is to hook/patch IO_COND(), which can be done with
> > > > neglible overhead because the helpers that use IO_COND() are not inlined.  In a
> > > > TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
> > > > majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
> > > > And if there are TDX VMMs that want to deploy virtio-mmio, hooking
> > > > drivers/virtio/virtio_mmio.c directly would be a viable option.
> > > Yes but what's the point of all that?
> > Patching IO_COND() is relatively low effort.  With some clever refactoring, I
> > suspect the net lines of code added would be less than 10.  That seems like a
> > worthwhile effort to avoid millions of faults over the lifetime of the guest.
> 
> AFAIK IO_COND is only for iomap users. But most drivers don't even use
> iomap. virtio doesn't for example, and that's really the only case we
> currently care about.

virtio-pci, which is going to used by pretty much all traditional VMs, uses iomap.
See vp_get(), vp_set(), and all the vp_io{read,write}*() wrappers.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 20:20                 ` Andi Kleen
@ 2021-05-18 20:40                   ` Dave Hansen
  2021-05-18 21:05                     ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 20:40 UTC (permalink / raw)
  To: Andi Kleen, Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

On 5/18/21 1:20 PM, Andi Kleen wrote:
> 
> On 5/18/2021 10:46 AM, Dave Hansen wrote:
>> On 5/18/21 10:21 AM, Andi Kleen wrote:
>>> Besides instruction decoding works fine for all the existing
>>> hypervisors. All we really want to do is to do the same thing as KVM
>>> would do.
>> Dumb question of the day: If you want to do the same thing that KVM
>> does, why don't you share more code with KVM?  Wouldn't you, for
>> instance, need to crack the same instruction opcodes?
> 
> We're talking about ~60 lines of codes that calls an established
> standard library.
> 
> https://github.com/intel/tdx/blob/8c20c364d1f52e432181d142054b1c2efa0ae6d3/arch/x86/kernel/tdx.c#L490
> 
> You're proposing a gigantic refactoring to avoid 60 lines of straight
> forward code.
> 
> That's not a practical proposal.

Hi Andi,

I'm not actually trying to propose things.  I'm really just trying to
get an idea why the implementation ended up how it did.  I actually
entirely respect the position that the KVM code is a monster and
shouldn't get reused.  That seems totally reasonable.

What isn't reasonable is the lack of documentation of these design
decisions in the changelogs.  My goal here is to raise the quality of
the changelogs so that other reviewers and maintainers don't have to ask
these questions when they perform their reviews.

This is honestly the best way I know to help get this code merged as
soon as possible.  If I'm not helping, please let me know.  I'm happy to
spend my time elsewhere.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 20:37                   ` Sean Christopherson
@ 2021-05-18 20:56                     ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 20:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel



> virtio-pci, which is going to used by pretty much all traditional VMs, uses iomap.
> See vp_get(), vp_set(), and all the vp_io{read,write}*() wrappers.

That's true. But there are still all the other users. So it doesn't 
solve the problem. In the end I'm fairly sure we would need to patch 
readl/writel and friends.

-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18 20:19                       ` Dave Hansen
@ 2021-05-18 20:57                         ` Kuppuswamy, Sathyanarayanan
  2021-05-18 21:19                         ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-18 20:57 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Isaku Yamahata



On 5/18/21 1:19 PM, Dave Hansen wrote:
> But, seriously, you don't need to preserve a SoB for a one-line patch.
> Just pull the line in and make a note in the changelog.

Ok. Makes sense. I will leave the comment and remove SOB from Isaku.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 20:40                   ` Dave Hansen
@ 2021-05-18 21:05                     ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 21:05 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel


> I'm not actually trying to propose things.  I'm really just trying to
> get an idea why the implementation ended up how it did.  I actually
> entirely respect the position that the KVM code is a monster and
> shouldn't get reused.  That seems totally reasonable.

Mainly because it's relatively simple and straight forward to do it this 
way, Yes I know, that's a shocking concept, but sometimes it works even 
in Linux code.

>
> What isn't reasonable is the lack of documentation of these design
> decisions in the changelogs.  My goal here is to raise the quality of
> the changelogs so that other reviewers and maintainers don't have to ask
> these questions when they perform their reviews.
>
> This is honestly the best way I know to help get this code merged as
> soon as possible.  If I'm not helping, please let me know.  I'm happy to
> spend my time elsewhere.

I'm sure the commit logs can be improved and I appreciate your feedback.


I don't think every commit log needs to be an extended essay meandering 
all over the possible design space, talking about everything that could 
have been and wasn't. The way code is normally written is that we don't 
do an exhaustive search of possible options, but instead we pick a 
reasonable path and as long as that works and doesn't have too many 
problems we just stick to it. The commit log reflects that single path 
chosen, with only rare exceptions to talk about dead alleys.

In this case you can even see that multiple independent efforts (AMD and 
Intel) came mostly to fairly similar implementations, so the path chosen 
wasn't really that strange or non obvious.

Also overall I would appreciate if people would focus more on the code 
than the commit logs. Commit logs are important, but in the end what 
really matters is that the code is correct.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18 20:19                       ` Dave Hansen
  2021-05-18 20:57                         ` Kuppuswamy, Sathyanarayanan
@ 2021-05-18 21:19                         ` Kuppuswamy Sathyanarayanan
  2021-05-18 23:29                           ` Dave Hansen
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-18 21:19 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

KVM hypercalls use the "vmcall" or "vmmcall" instructions.
Although the ABI is similar, those instructions no longer
function for TDX guests. Make vendor-specific TDVMCALLs
instead of VMCALL. This enables TDX guests to run with KVM
acting as the hypervisor. TDX guests running under other
hypervisors will continue to use those hypervisor's
hypercalls.

Since KVM hypercall functions can be included and called
from kernel modules, export tdx_kvm_hypercall*() functions
to avoid symbol errors

[Isaku Yamahata: proposed KVM VENDOR string]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-fix:
 * Removed "user help" for INTEL_TDX_GUEST_KVM config option
   and added a comment for it.
 * Added details about exporting symbols in the commit log.
 * Removed Isaku's sign-off.

Changes since RFC v2: 
 * Introduced INTEL_TDX_GUEST_KVM config for TDX+KVM related changes.
 * Removed "C" include file.
 * Fixed commit log as per Dave's comments.

 arch/x86/Kconfig                |  5 ++++
 arch/x86/include/asm/kvm_para.h | 21 +++++++++++++++
 arch/x86/include/asm/tdx.h      | 41 ++++++++++++++++++++++++++++
 arch/x86/kernel/Makefile        |  1 +
 arch/x86/kernel/tdcall.S        | 20 ++++++++++++++
 arch/x86/kernel/tdx-kvm.c       | 48 +++++++++++++++++++++++++++++++++
 6 files changed, 136 insertions(+)
 create mode 100644 arch/x86/kernel/tdx-kvm.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9e0e0ff76bab..15e66a99dd41 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -886,6 +886,11 @@ config INTEL_TDX_GUEST
 	  run in a CPU mode that protects the confidentiality of TD memory
 	  contents and the TD’s CPU state from other software, including VMM.
 
+# This option enables KVM specific hypercalls in TDX guest.
+config INTEL_TDX_GUEST_KVM
+	def_bool y
+	depends on KVM_GUEST && INTEL_TDX_GUEST
+
 endif #HYPERVISOR_GUEST
 
 source "arch/x86/Kconfig.cpu"
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..2fa85481520b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,6 +6,7 @@
 #include <asm/alternative.h>
 #include <linux/interrupt.h>
 #include <uapi/asm/kvm_para.h>
+#include <asm/tdx.h>
 
 extern void kvmclock_init(void);
 
@@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 static inline long kvm_hypercall0(unsigned int nr)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall0(nr);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr)
@@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
 static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall1(nr, p1);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1)
@@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
 				  unsigned long p2)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall2(nr, p1, p2);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2)
@@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
 				  unsigned long p2, unsigned long p3)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall3(nr, p1, p2, p3);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 				  unsigned long p4)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8ab4067afefc..eb758b506dba 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -73,4 +73,45 @@ static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
+			       u64 r15, struct tdx_hypercall_output *out);
+long tdx_kvm_hypercall0(unsigned int nr);
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1);
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2);
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3);
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3, unsigned long p4);
+#else
+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+				      unsigned long p2)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3,
+				      unsigned long p4)
+{
+	return -ENODEV;
+}
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
+
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 7966c10ea8d1..a90fec004844 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -128,6 +128,7 @@ obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
 obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index a484c4aef6e6..3c57a1d67b79 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -25,6 +25,8 @@
 					  TDG_R12 | TDG_R13 | \
 					  TDG_R14 | TDG_R15 )
 
+#define TDVMCALL_VENDOR_KVM		0x4d564b2e584454 /* "TDX.KVM" */
+
 /*
  * TDX guests use the TDCALL instruction to make requests to the
  * TDX module and hypercalls to the VMM. It is supported in
@@ -213,3 +215,21 @@ SYM_FUNC_START(__tdx_hypercall)
 	call do_tdx_hypercall
 	retq
 SYM_FUNC_END(__tdx_hypercall)
+
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+/*
+ * Helper function for KVM vendor TDVMCALLs. This assembly wrapper
+ * lets us reuse do_tdvmcall() for KVM-specific hypercalls (
+ * TDVMCALL_VENDOR_KVM).
+ */
+SYM_FUNC_START(__tdx_hypercall_vendor_kvm)
+	/*
+	 * R10 is not part of the function call ABI, but it is a part
+	 * of the TDVMCALL ABI. So set it before making call to the
+	 * do_tdx_hypercall().
+	 */
+	movq $TDVMCALL_VENDOR_KVM, %r10
+	call do_tdx_hypercall
+	retq
+SYM_FUNC_END(__tdx_hypercall_vendor_kvm)
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
diff --git a/arch/x86/kernel/tdx-kvm.c b/arch/x86/kernel/tdx-kvm.c
new file mode 100644
index 000000000000..b21453a81e38
--- /dev/null
+++ b/arch/x86/kernel/tdx-kvm.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2020 Intel Corporation */
+
+#include <asm/tdx.h>
+
+static long tdx_kvm_hypercall(unsigned int fn, unsigned long r12,
+			      unsigned long r13, unsigned long r14,
+			      unsigned long r15)
+{
+	return __tdx_hypercall_vendor_kvm(fn, r12, r13, r14, r15, NULL);
+}
+
+/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall0);
+
+/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return tdx_kvm_hypercall(nr, p1, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall1);
+
+/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2)
+{
+	return tdx_kvm_hypercall(nr, p1, p2, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall2);
+
+/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3)
+{
+	return tdx_kvm_hypercall(nr, p1, p2, p3, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall3);
+
+/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3, unsigned long p4)
+{
+	return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Make DMA pages shared
  2021-05-18 19:55     ` Sean Christopherson
@ 2021-05-18 22:12       ` Kuppuswamy, Sathyanarayanan
  2021-05-18 22:31         ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-18 22:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel, Kai Huang,
	Sean Christopherson



On 5/18/21 12:55 PM, Sean Christopherson wrote:
> On Mon, May 17, 2021, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> Intel TDX doesn't allow VMM to access guest memory. Any memory
>                                               ^
>                                               |- private
> 
> And to be pedantic, the VMM can _access_ guest private memory all it wants, it
> just can't decrypt guest private memory.

Ok. I will use "guest private memory".

> 
>> that is required for communication with VMM must be shared
>> explicitly by setting the bit in page table entry. And, after
>> setting the shared bit, the conversion must be completed with
>> MapGPA TDVMALL. The call informs VMM about the conversion and
>> makes it remove the GPA from the S-EPT mapping.
> 
> The VMM is _not_ required to remove the GPA from the S-EPT.  E.g. if the VMM
> wants to, it can leave a 2mb private page intact and create a 4kb shared page
> translation within the same range (ignoring the shared bit).

So does removing "makes it remove the GPA from the S-EPT mapping"
be sufficient? Or you want to add more detail?


> 
>> The shared memory is similar to unencrypted memory in AMD SME/SEV
>> terminology but the underlying process of sharing/un-sharing the memory is
>> different for Intel TDX guest platform.
>>
>> SEV assumes that I/O devices can only do DMA to "decrypted"
>> physical addresses without the C-bit set.  In order for the CPU
>> to interact with this memory, the CPU needs a decrypted mapping.
>> To add this support, AMD SME code forces force_dma_unencrypted()
>> to return true for platforms that support AMD SEV feature. It will
>> be used for DMA memory allocation API to trigger
>> set_memory_decrypted() for platforms that support AMD SEV feature.
>>
>> TDX is similar.  TDX architecturally prevents access to private
> 
> TDX doesn't prevent accesses.  If hardware _prevented_ accesses then we wouldn't
> have to deal with the #MC mess.
How about following change?

"TDX is similar. TDX architecturally prevents access to private guest memory by
  anything other than the guest itself.This means that any DMA buffers must be
  shared."

modified to =>

"TDX is similar. In TDX architecture, the private guest memory is encrypted, which
prevents anything other than guest from accessing/modifying it. So to communicate
with I/O devices, we need to create decrypted mapping and make the pages shared."

> 
>> guest memory by anything other than the guest itself. This means
>> that any DMA buffers must be shared.
>>
>> So create a new file mem_encrypt_tdx.c to hold TDX specific memory
>> initialization code, and re-define force_dma_unencrypted() for
>> TDX guest and make it return true to get DMA pages mapped as shared.
>>
>> __set_memory_enc_dec() is now aware about TDX and sets Shared bit
>> accordingly following with relevant TDVMCALL.
>>
>> Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when
> 
> This should call out that the current TDX spec only supports 4kb AUG/ACCEPT.

Ok. I will add this spec detail.

> 
> On that topic... are there plans to support 2mb and/or 1gb TDH.MEM.PAGE.AUG?  If
> so, will TDG.MEM.PAGE.ACCEPT also support 2mb/1gb granularity?
> 
>> converting memory to private.  If the VMM uses a common pool for private
>> and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA
>> (or on the first access to the private GPA),
> 
> What the VMM does or does not do is irrelevant.  What matters is what the VMM is
> _allowed_ to do without violating the GHCI.  Specifically, the VMM is allowed to
> unmap a private page in response to MAP_GPA to convert to a shared page.
> 
>    If the GPA (range) was already mapped as an active, private page, the host
>    VMM may remove the private page from the TD by following the “Removing TD
>    Private Pages” sequence in the Intel TDX-module specification [3] to safely
>    block the mapping(s), flush the TLB and cache, and remove the mapping(s).
> 
> That would also provide a nice segue into the "already accepted" error below.

Ok. I will add the above detail.

> 
>> in which case TDX-Module will hold the page in a non-present "pending" state
>> until it is explicitly accepted.
>>
>> BUG() if TDACCEPTPAGE fails (except the above case)
> 
> What above case?  The code handles the case where the page was already accepted,
> but the changelog doesn't talk about that at all.

I think it meant about "already accepted" page case. With your above suggestion,
we can ignore this error. Or I can change it to,

BUG() if TDACCEPTPAGE fails (except "previously accepted page" case)

> 
>> as the guest is completely hosed if it can't access memory.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Make DMA pages shared
  2021-05-18 22:12       ` Kuppuswamy, Sathyanarayanan
@ 2021-05-18 22:31         ` Dave Hansen
  2021-06-01  2:06           ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 22:31 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Sean Christopherson
  Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, linux-kernel, Kai Huang, Sean Christopherson

On 5/18/21 3:12 PM, Kuppuswamy, Sathyanarayanan wrote:
> "TDX is similar. In TDX architecture, the private guest memory is 
> encrypted, which prevents anything other than guest from
> accessing/modifying it. So to communicate with I/O devices, we need
> to create decrypted mapping and make the pages shared."

That's actually even more wrong. :(

Check out "Machine Check Architecture Background" in the TDX
architecture spec.

Modification is totally permitted in the architecture.  A host can write
all day long to guest memory.  Depending on how you use the word,
"access" can also include writes.

TDX really just prevents guests from *consuming* the gunk that an
attacker might write.

Also, don't say "decrypted".  The memory is probably still TME-enabled
and probably encrypted on the DIMM.  It's still encrypted even if
shared, it's just using the TME key, not the TD key.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18 21:19                         ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
@ 2021-05-18 23:29                           ` Dave Hansen
  2021-05-19  1:17                             ` [RFC v2-fix-v3 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 23:29 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

On 5/18/21 2:19 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> KVM hypercalls use the "vmcall" or "vmmcall" instructions.
> Although the ABI is similar, those instructions no longer
> function for TDX guests. Make vendor-specific TDVMCALLs
> instead of VMCALL. This enables TDX guests to run with KVM
> acting as the hypervisor. TDX guests running under other
> hypervisors will continue to use those hypervisor's
> hypercalls.

Well, I screwed this up when I typed it too, but it is:

	TDX guests running under other hypervisors will continue
	to use those hypervisors' hypercalls.

I hate how that reads, but oh well.

> Since KVM hypercall functions can be included and called
> from kernel modules, export tdx_kvm_hypercall*() functions
> to avoid symbol errors

No, you're not avoiding errors, you're exporting the symbol so it can be
*USED*.  The error comes from it not being exported.

It also helps to be specific here:  Export tdx_kvm_hypercall*() to make
the symbols visible to kvm.ko.

> [Isaku Yamahata: proposed KVM VENDOR string]
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Reviewed-by: Dave Hansen <dave.hansen@intel.com>

Also, FWIW, if you did this in the header:

+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
+}

You could get away with just exporting tdx_kvm_hypercall() instead of 4
symbols.  The rest of the code would look the same.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v3 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18 23:29                           ` Dave Hansen
@ 2021-05-19  1:17                             ` Kuppuswamy Sathyanarayanan
  2021-05-19  1:20                               ` Sathyanarayanan Kuppuswamy Natarajan
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-19  1:17 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

KVM hypercalls use the "vmcall" or "vmmcall" instructions.
Although the ABI is similar, those instructions no longer
function for TDX guests. Make vendor-specific TDVMCALLs
instead of VMCALL. This enables TDX guests to run with KVM
acting as the hypervisor. TDX guests running under other
hypervisors will continue to use those hypervisors'
hypercalls.

Since KVM driver can be built as a kernel module, export
tdx_kvm_hypercall*() to make the symbols visible to kvm.ko.

[Isaku Yamahata: proposed KVM VENDOR string]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/Kconfig                |  5 +++
 arch/x86/include/asm/kvm_para.h | 21 ++++++++++
 arch/x86/include/asm/tdx.h      | 68 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdcall.S        | 26 +++++++++++++
 4 files changed, 120 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9e0e0ff76bab..15e66a99dd41 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -886,6 +886,11 @@ config INTEL_TDX_GUEST
 	  run in a CPU mode that protects the confidentiality of TD memory
 	  contents and the TD’s CPU state from other software, including VMM.
 
+# This option enables KVM specific hypercalls in TDX guest.
+config INTEL_TDX_GUEST_KVM
+	def_bool y
+	depends on KVM_GUEST && INTEL_TDX_GUEST
+
 endif #HYPERVISOR_GUEST
 
 source "arch/x86/Kconfig.cpu"
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..2fa85481520b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,6 +6,7 @@
 #include <asm/alternative.h>
 #include <linux/interrupt.h>
 #include <uapi/asm/kvm_para.h>
+#include <asm/tdx.h>
 
 extern void kvmclock_init(void);
 
@@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 static inline long kvm_hypercall0(unsigned int nr)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall0(nr);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr)
@@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
 static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall1(nr, p1);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1)
@@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
 				  unsigned long p2)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall2(nr, p1, p2);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2)
@@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
 				  unsigned long p2, unsigned long p3)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall3(nr, p1, p2, p3);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 				  unsigned long p4)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8ab4067afefc..3d8d977e52f0 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -73,4 +73,72 @@ static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
+			       u64 r15, struct tdx_hypercall_output *out);
+
+/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return __tdx_hypercall_vendor_kvm(nr, 0, 0, 0, 0, NULL);
+}
+
+/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return __tdx_hypercall_vendor_kvm(nr, p1, 0, 0, 0, NULL);
+}
+
+/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+				      unsigned long p2)
+{
+	return __tdx_hypercall_vendor_kvm(nr, p1, p2, 0, 0, NULL);
+}
+
+/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3)
+{
+	return __tdx_hypercall_vendor_kvm(nr, p1, p2, p3, 0, NULL);
+}
+
+/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3,
+				      unsigned long p4)
+{
+	return __tdx_hypercall_vendor_kvm(nr, p1, p2, p3, p4, NULL);
+}
+#else
+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+				      unsigned long p2)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3,
+				      unsigned long p4)
+{
+	return -ENODEV;
+}
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
+
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index 2dfecdae38bb..27355fb80aeb 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -3,6 +3,7 @@
 #include <asm/asm.h>
 #include <asm/frame.h>
 #include <asm/unwind_hints.h>
+#include <asm/export.h>
 
 #include <linux/linkage.h>
 #include <linux/bits.h>
@@ -25,6 +26,8 @@
 					  TDG_R12 | TDG_R13 | \
 					  TDG_R14 | TDG_R15 )
 
+#define TDVMCALL_VENDOR_KVM		0x4d564b2e584454 /* "TDX.KVM" */
+
 /*
  * TDX guests use the TDCALL instruction to make requests to the
  * TDX module and hypercalls to the VMM. It is supported in
@@ -212,3 +215,26 @@ SYM_FUNC_START(__tdx_hypercall)
 	FRAME_END
 	retq
 SYM_FUNC_END(__tdx_hypercall)
+
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+
+/*
+ * Helper function for KVM vendor TDVMCALLs. This assembly wrapper
+ * lets us reuse do_tdvmcall() for KVM-specific hypercalls (
+ * TDVMCALL_VENDOR_KVM).
+ */
+SYM_FUNC_START(__tdx_hypercall_vendor_kvm)
+	FRAME_BEGIN
+	/*
+	 * R10 is not part of the function call ABI, but it is a part
+	 * of the TDVMCALL ABI. So set it before making call to the
+	 * do_tdx_hypercall().
+	 */
+	movq $TDVMCALL_VENDOR_KVM, %r10
+	call do_tdx_hypercall
+	FRAME_END
+	retq
+SYM_FUNC_END(__tdx_hypercall_vendor_kvm)
+
+EXPORT_SYMBOL(__tdx_hypercall_vendor_kvm);
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v3 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-19  1:17                             ` [RFC v2-fix-v3 " Kuppuswamy Sathyanarayanan
@ 2021-05-19  1:20                               ` Sathyanarayanan Kuppuswamy Natarajan
  0 siblings, 0 replies; 381+ messages in thread
From: Sathyanarayanan Kuppuswamy Natarajan @ 2021-05-19  1:20 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Dan Williams, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List

Sorry, I have missed to include a change log.

* Removed tdx-kvm.c and implemented tdx_kvm_hypercall*() functions in tdx.h
* Exported __tdx_hypercall_vendor_kvm() symbol for kvm.ko.
* Fixed commit log as per Dave's suggestion.
* Added Reviewed-by from Dave
* Added FRAME_BEGIN/FRAME_END for __tdx_hypercall_vendor_kvm() to fix
compiler warnings.

On Tue, May 18, 2021 at 6:17 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> KVM hypercalls use the "vmcall" or "vmmcall" instructions.
> Although the ABI is similar, those instructions no longer
> function for TDX guests. Make vendor-specific TDVMCALLs
> instead of VMCALL. This enables TDX guests to run with KVM
> acting as the hypervisor. TDX guests running under other
> hypervisors will continue to use those hypervisors'
> hypercalls.
>
> Since KVM driver can be built as a kernel module, export
> tdx_kvm_hypercall*() to make the symbols visible to kvm.ko.
>
> [Isaku Yamahata: proposed KVM VENDOR string]
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Reviewed-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/Kconfig                |  5 +++
>  arch/x86/include/asm/kvm_para.h | 21 ++++++++++
>  arch/x86/include/asm/tdx.h      | 68 +++++++++++++++++++++++++++++++++
>  arch/x86/kernel/tdcall.S        | 26 +++++++++++++
>  4 files changed, 120 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9e0e0ff76bab..15e66a99dd41 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -886,6 +886,11 @@ config INTEL_TDX_GUEST
>           run in a CPU mode that protects the confidentiality of TD memory
>           contents and the TD’s CPU state from other software, including VMM.
>
> +# This option enables KVM specific hypercalls in TDX guest.
> +config INTEL_TDX_GUEST_KVM
> +       def_bool y
> +       depends on KVM_GUEST && INTEL_TDX_GUEST
> +
>  endif #HYPERVISOR_GUEST
>
>  source "arch/x86/Kconfig.cpu"
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 338119852512..2fa85481520b 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -6,6 +6,7 @@
>  #include <asm/alternative.h>
>  #include <linux/interrupt.h>
>  #include <uapi/asm/kvm_para.h>
> +#include <asm/tdx.h>
>
>  extern void kvmclock_init(void);
>
> @@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
>  static inline long kvm_hypercall0(unsigned int nr)
>  {
>         long ret;
> +
> +       if (is_tdx_guest())
> +               return tdx_kvm_hypercall0(nr);
> +
>         asm volatile(KVM_HYPERCALL
>                      : "=a"(ret)
>                      : "a"(nr)
> @@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
>  static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
>  {
>         long ret;
> +
> +       if (is_tdx_guest())
> +               return tdx_kvm_hypercall1(nr, p1);
> +
>         asm volatile(KVM_HYPERCALL
>                      : "=a"(ret)
>                      : "a"(nr), "b"(p1)
> @@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
>                                   unsigned long p2)
>  {
>         long ret;
> +
> +       if (is_tdx_guest())
> +               return tdx_kvm_hypercall2(nr, p1, p2);
> +
>         asm volatile(KVM_HYPERCALL
>                      : "=a"(ret)
>                      : "a"(nr), "b"(p1), "c"(p2)
> @@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
>                                   unsigned long p2, unsigned long p3)
>  {
>         long ret;
> +
> +       if (is_tdx_guest())
> +               return tdx_kvm_hypercall3(nr, p1, p2, p3);
> +
>         asm volatile(KVM_HYPERCALL
>                      : "=a"(ret)
>                      : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
> @@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
>                                   unsigned long p4)
>  {
>         long ret;
> +
> +       if (is_tdx_guest())
> +               return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
> +
>         asm volatile(KVM_HYPERCALL
>                      : "=a"(ret)
>                      : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 8ab4067afefc..3d8d977e52f0 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -73,4 +73,72 @@ static inline void tdx_early_init(void) { };
>
>  #endif /* CONFIG_INTEL_TDX_GUEST */
>
> +#ifdef CONFIG_INTEL_TDX_GUEST_KVM
> +u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
> +                              u64 r15, struct tdx_hypercall_output *out);
> +
> +/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall0(unsigned int nr)
> +{
> +       return __tdx_hypercall_vendor_kvm(nr, 0, 0, 0, 0, NULL);
> +}
> +
> +/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
> +{
> +       return __tdx_hypercall_vendor_kvm(nr, p1, 0, 0, 0, NULL);
> +}
> +
> +/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
> +                                     unsigned long p2)
> +{
> +       return __tdx_hypercall_vendor_kvm(nr, p1, p2, 0, 0, NULL);
> +}
> +
> +/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
> +                                     unsigned long p2, unsigned long p3)
> +{
> +       return __tdx_hypercall_vendor_kvm(nr, p1, p2, p3, 0, NULL);
> +}
> +
> +/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
> +                                     unsigned long p2, unsigned long p3,
> +                                     unsigned long p4)
> +{
> +       return __tdx_hypercall_vendor_kvm(nr, p1, p2, p3, p4, NULL);
> +}
> +#else
> +static inline long tdx_kvm_hypercall0(unsigned int nr)
> +{
> +       return -ENODEV;
> +}
> +
> +static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
> +{
> +       return -ENODEV;
> +}
> +
> +static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
> +                                     unsigned long p2)
> +{
> +       return -ENODEV;
> +}
> +
> +static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
> +                                     unsigned long p2, unsigned long p3)
> +{
> +       return -ENODEV;
> +}
> +
> +static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
> +                                     unsigned long p2, unsigned long p3,
> +                                     unsigned long p4)
> +{
> +       return -ENODEV;
> +}
> +#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
> +
>  #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> index 2dfecdae38bb..27355fb80aeb 100644
> --- a/arch/x86/kernel/tdcall.S
> +++ b/arch/x86/kernel/tdcall.S
> @@ -3,6 +3,7 @@
>  #include <asm/asm.h>
>  #include <asm/frame.h>
>  #include <asm/unwind_hints.h>
> +#include <asm/export.h>
>
>  #include <linux/linkage.h>
>  #include <linux/bits.h>
> @@ -25,6 +26,8 @@
>                                           TDG_R12 | TDG_R13 | \
>                                           TDG_R14 | TDG_R15 )
>
> +#define TDVMCALL_VENDOR_KVM            0x4d564b2e584454 /* "TDX.KVM" */
> +
>  /*
>   * TDX guests use the TDCALL instruction to make requests to the
>   * TDX module and hypercalls to the VMM. It is supported in
> @@ -212,3 +215,26 @@ SYM_FUNC_START(__tdx_hypercall)
>         FRAME_END
>         retq
>  SYM_FUNC_END(__tdx_hypercall)
> +
> +#ifdef CONFIG_INTEL_TDX_GUEST_KVM
> +
> +/*
> + * Helper function for KVM vendor TDVMCALLs. This assembly wrapper
> + * lets us reuse do_tdvmcall() for KVM-specific hypercalls (
> + * TDVMCALL_VENDOR_KVM).
> + */
> +SYM_FUNC_START(__tdx_hypercall_vendor_kvm)
> +       FRAME_BEGIN
> +       /*
> +        * R10 is not part of the function call ABI, but it is a part
> +        * of the TDVMCALL ABI. So set it before making call to the
> +        * do_tdx_hypercall().
> +        */
> +       movq $TDVMCALL_VENDOR_KVM, %r10
> +       call do_tdx_hypercall
> +       FRAME_END
> +       retq
> +SYM_FUNC_END(__tdx_hypercall_vendor_kvm)
> +
> +EXPORT_SYMBOL(__tdx_hypercall_vendor_kvm);
> +#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
> --
> 2.25.1
>


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-04-26 18:01 ` [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Kuppuswamy Sathyanarayanan
@ 2021-05-19  5:00   ` Kuppuswamy, Sathyanarayanan
  2021-05-19 16:14   ` Dave Hansen
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-19  5:00 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

Hi Dave,

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov"<kirill.shutemov@linux.intel.com>
> 
> tdx_shared_mask() returns the mask that has to be set in a page
> table entry to make page shared with VMM.
> 
> Also, note that we cannot club shared mapping configuration between
> AMD SME and Intel TDX Guest platforms in common function. SME has
> to do it very early in __startup_64() as it sets the bit on all
> memory, except what is used for communication. TDX can postpone as
> we don't need any shared mapping in very early boot.
> 
> Signed-off-by: Kirill A. Shutemov<kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen<ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan<sathyanarayanan.kuppuswamy@linux.intel.com>

Any comments on this patch?

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-04-27 19:20               ` Dave Hansen
  2021-04-28 17:42                 ` [PATCH v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() " Kuppuswamy Sathyanarayanan
@ 2021-05-19  5:58                 ` Kuppuswamy Sathyanarayanan
  2021-05-19  6:04                   ` Kuppuswamy, Sathyanarayanan
  2021-05-19 15:31                   ` Dave Hansen
  1 sibling, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-19  5:58 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kuppuswamy Sathyanarayanan

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
expose guest state to the host.  This prevents the old hypercall
mechanisms from working. So to communicate with VMM, TDX
specification defines a new instruction called "tdcall".

In TDX based VM, since VMM is an untrusted entity, a intermediary
layer (TDX module) exists between host and guest to facilitate the
secure communication. And "tdcall" instruction  is used by the guest
to request services from TDX module. And a variant of "tdcall"
instruction (with specific arguments as defined by GHCI) is used by
the guest to request services from  VMM via the TDX module.

Implement common helper functions to communicate with the TDX Module
and VMM (using TDCALL instruction).
   
__tdx_hypercall()    - function can be used to request services from
		       the VMM.
__tdx_module_call()  - function can be used to communicate with the
		       TDX Module.

Also define two additional wrappers, tdx_hypercall() and
tdx_hypercall_out_r11() to cover common use cases of
__tdx_hypercall() function. Since each use case of
__tdx_module_call() is different, we don't need such wrappers for it.

Implement __tdx_module_call() and __tdx_hypercall() helper functions
in assembly.

Rationale behind choosing to use assembly over inline assembly are,

1. Since the number of lines of instructions (with comments) in
__tdx_hypercall() implementation is over 70, using inline assembly
to implement it will make it hard to read.
   
2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
TDCALL operation, if all these registers are included in in-line
assembly constraints, some of the older compilers may not
be able to meet this requirement.

Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
use the same set of argument registers. The implementation here picks
the current worst-case scenario for TDCALL (4 registers). For TDCALLs
with fewer than 4 arguments, there will end up being a few superfluous
(cheap) instructions.  But, this approach maximizes code reuse. The
same argument applies to __tdx_hypercall() function as well.

Current implementation of __tdx_hypercall() includes error handling
(ud2 on failure case) in assembly function instead of doing it in C
wrapper function. The reason behind this choice is, when adding support
for in/out instructions (refer to patch titled "x86/tdx: Handle port
I/O" in this series), we use alternative_io() to substitute in/out
instruction with  __tdx_hypercall() calls. So use of C wrappers is not
trivial in this case because the input parameters will be in the wrong
registers and it's tricky to include proper buffer code to make this
happen.

For registers used by TDCALL instruction, please check TDX GHCI
specification, sec 2.4 and 3.

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Originally-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2: 
 * Renamed __tdcall()/__tdvmcall() to __tdx_module_call()/__tdx_hypercall().
 * Renamed reg offsets from TDCALL_rx to TDX_MODULE_rx.
 * Renamed reg offsets from TDVMCALL_rx to TDX_HYPERCALL_rx.
 * Renamed struct tdcall_output to struct tdx_module_output.
 * Renamed struct tdvmcall_output to struct tdx_hypercall_output.
 * Used BIT() to derive TDVMCALL_EXPOSE_REGS_MASK.
 * Removed unnecessary push/pop sequence in __tdcall() function.
 * Fixed comments as per Dave's review.

 arch/x86/include/asm/tdx.h    |  38 ++++++
 arch/x86/kernel/Makefile      |   2 +-
 arch/x86/kernel/asm-offsets.c |  22 ++++
 arch/x86/kernel/tdcall.S      | 222 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c         |  39 ++++++
 5 files changed, 322 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..211b9d66b1b1 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,50 @@
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
+#include <linux/types.h>
+
+/*
+ * Used in __tdx_module_call() helper function to gather the
+ * output registers values of TDCALL instruction when requesting
+ * services from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_module_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+/*
+ * Used in __tdx_hypercall() helper function to gather the
+ * output registers values of TDCALL instruction when requesting
+ * services from the VMM. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_hypercall_output {
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+};
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
 void __init tdx_early_init(void);
 
+/* Helper function used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		      struct tdx_module_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+		    struct tdx_hypercall_output *out);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..e6b3bb983992 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
 #include <xen/interface/xen.h>
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
 #ifdef CONFIG_X86_32
 # include "asm-offsets_32.c"
 #else
@@ -75,6 +79,24 @@ static void __used common(void)
 	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+	BLANK();
+	/* Offset for fields in tdcall_output */
+	OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+	OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+	OFFSET(TDX_MODULE_r8,  tdx_module_output, r8);
+	OFFSET(TDX_MODULE_r9,  tdx_module_output, r9);
+	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
+	/* Offset for fields in tdvmcall_output */
+	OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
+	OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
+	OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
+	OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
+	OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
+#endif
+
 	BLANK();
 	OFFSET(BP_scratch, boot_params, scratch);
 	OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..a67c595e4169
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,222 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+
+#define TDG_R10		BIT(10)
+#define TDG_R11		BIT(11)
+#define TDG_R12		BIT(12)
+#define TDG_R13		BIT(13)
+#define TDG_R14		BIT(14)
+#define TDG_R15		BIT(15)
+
+/*
+ * Expose registers R10-R15 to VMM. It is passed via RCX register
+ * to the TDX Module, which will be used by the TDX module to
+ * identify the list of registers exposed to VMM. Each bit in this
+ * mask represents a register ID. You can find the bit field details
+ * in TDX GHCI specification.
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK	( TDG_R10 | TDG_R11 | \
+					  TDG_R12 | TDG_R13 | \
+					  TDG_R14 | TDG_R15 )
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdx_module_call()  - Helper function used by TDX guests to request
+ * services from the TDX module (does not include VMM services).
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with the
+ * TDX module.  And if the "tdcall" operation is successful and a valid
+ * "struct tdx_module_output" pointer is available (in "out" argument),
+ * output from the TDX module is saved to the memory specified in the
+ * "out" pointer. Also the status of the "tdcall" operation is returned
+ * back to the user as a function return value.
+ *
+ * @fn  (RDI)		- TDCALL Leaf ID,    moved to RAX
+ * @rcx (RSI)		- Input parameter 1, moved to RCX
+ * @rdx (RDX)		- Input parameter 2, moved to RDX
+ * @r8  (RCX)		- Input parameter 3, moved to R8
+ * @r9  (R8)		- Input parameter 4, moved to R9
+ *
+ * @out (R9)		- struct tdx_module_output pointer
+ *			  stored temporarily in R12 (not
+ * 			  shared with the TDX module)
+ *
+ * Return status of tdcall via RAX.
+ *
+ * NOTE: This function should not be used for TDX hypercall
+ *       use cases.
+ */
+SYM_FUNC_START(__tdx_module_call)
+	FRAME_BEGIN
+
+	/*
+	 * R12 will be used as temporary storage for
+	 * struct tdx_module_output pointer. You can
+	 * find struct tdx_module_output details in
+	 * arch/x86/include/asm/tdx.h. Also note that
+	 * registers R12-R15 are not used by TDCALL
+	 * services supported by this helper function.
+	 */
+	push %r12	/* Callee saved, so preserve it */
+	mov %r9,  %r12 	/* Move output pointer to R12 */
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	mov %rdi, %rax	/* Move TDCALL Leaf ID to RAX */
+	mov %r8,  %r9	/* Move input 4 to R9 */
+	mov %rcx, %r8	/* Move input 3 to R8 */
+	mov %rsi, %rcx	/* Move input 1 to RCX */
+	/* Leave input param 2 in RDX */
+
+	tdcall
+
+	/* Check for TDCALL success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check for TDCALL output struct != NULL */
+	test %r12, %r12
+	jz 1f
+
+	/* Copy TDCALL result registers to output struct: */
+	movq %rcx, TDX_MODULE_rcx(%r12)
+	movq %rdx, TDX_MODULE_rdx(%r12)
+	movq %r8,  TDX_MODULE_r8(%r12)
+	movq %r9,  TDX_MODULE_r9(%r12)
+	movq %r10, TDX_MODULE_r10(%r12)
+	movq %r11, TDX_MODULE_r11(%r12)
+1:
+	pop %r12 /* Restore the state of R12 register */
+
+	FRAME_END
+	ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * do_tdx_hypercall()  - Helper function used by TDX guests to request
+ * services from the VMM. All requests are made via the TDX module
+ * using "TDCALL" instruction.
+ *
+ * This function is created to contain common between vendor specific
+ * and standard type tdx hypercalls. So the caller of this function had
+ * to set the TDVMCALL type in the R10 register before calling it.
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with VMM
+ * via the TDX module. And if the "tdcall" operation is successful and a
+ * valid "struct tdx_hypercall_output" pointer is available (in "out"
+ * argument), output from the VMM is saved to the memory specified in the
+ * "out" pointer. 
+ *
+ * @fn  (RDI)		- TDVMCALL function, moved to R11
+ * @r12 (RSI)		- Input parameter 1, moved to R12
+ * @r13 (RDX)		- Input parameter 2, moved to R13
+ * @r14 (RCX)		- Input parameter 3, moved to R14
+ * @r15 (R8)		- Input parameter 4, moved to R15
+ *
+ * @out (R9)		- struct tdx_hypercall_output pointer
+ *
+ * On successful completion, return TDX hypercall error code.
+ * If the "tdcall" operation fails, panic.
+ *
+ */
+SYM_FUNC_START_LOCAL(do_tdx_hypercall)
+	/* Save non-volatile GPRs that are exposed to the VMM. */
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+
+	/* Leave hypercall output pointer in R9, it's not clobbered by VMM */
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	xor %eax, %eax /* Move TDCALL leaf ID (TDVMCALL (0)) to RAX */
+	mov %rdi, %r11 /* Move TDVMCALL function id to R11 */
+	mov %rsi, %r12 /* Move input 1 to R12 */
+	mov %rdx, %r13 /* Move input 2 to R13 */
+	mov %rcx, %r14 /* Move input 1 to R14 */
+	mov %r8,  %r15 /* Move input 1 to R15 */
+	/* Caller of do_tdx_hypercall() will set TDVMCALL type in R10 */
+
+	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+	tdcall
+
+	/*
+	 * Check for TDCALL success: 0 - Successful, otherwise failed.
+	 * If failed, there is an issue with TDX Module which is fatal
+	 * for the guest. So panic. Also note that RAX is controlled
+	 * only by the TDX module and not exposed to VMM.
+	 */
+	test %rax, %rax
+	jnz 2f
+
+	/* Move hypercall error code to RAX to return to user */
+	mov %r10, %rax
+
+	/* Check for hypercall success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check for hypercall output struct != NULL */
+	test %r9, %r9
+	jz 1f
+
+	/* Copy hypercall result registers to output struct: */
+	movq %r11, TDX_HYPERCALL_r11(%r9)
+	movq %r12, TDX_HYPERCALL_r12(%r9)
+	movq %r13, TDX_HYPERCALL_r13(%r9)
+	movq %r14, TDX_HYPERCALL_r14(%r9)
+	movq %r15, TDX_HYPERCALL_r15(%r9)
+1:
+	/*
+	 * Zero out registers exposed to the VMM to avoid
+	 * speculative execution with VMM-controlled values.
+	 */
+	xor %r10d, %r10d
+	xor %r11d, %r11d
+	xor %r12d, %r12d
+	xor %r13d, %r13d
+	xor %r14d, %r14d
+	xor %r15d, %r15d
+
+	/* Restore non-volatile GPRs that are exposed to the VMM. */
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	ret
+2:
+	ud2
+SYM_FUNC_END(do_tdx_hypercall)
+
+/*
+ * Helper function for for standard type of TDVMCALLs. This assembly
+ * wrapper lets us reuse do_tdvmcall() for standard type of hypercalls
+ * (R10 is set as zero).
+ */
+SYM_FUNC_START(__tdx_hypercall)
+	FRAME_BEGIN
+	/*
+	 * R10 is not part of the function call ABI, but it is a part
+	 * of the TDVMCALL ABI. So set it 0 for standard type TDVMCALL
+	 * before making call to the do_tdx_hypercall().
+	 */
+	xor %r10, %r10
+	call do_tdx_hypercall
+	FRAME_END
+	retq
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 6a7193fead08..cbfefc42641e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -1,8 +1,47 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (C) 2020 Intel Corporation */
 
+#define pr_fmt(fmt) "TDX: " fmt
+
 #include <asm/tdx.h>
 
+/*
+ * Wrapper for simple hypercalls that only return a success/error code.
+ */
+static inline u64 tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+	u64 err;
+
+	err = __tdx_hypercall(fn, r12, r13, r14, r15, NULL);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return err;
+}
+
+/*
+ * Wrapper for the semi-common case where we need single output
+ * value (R11). Callers of this function does not care about the
+ * hypercall error code (mainly for IN or MMIO usecase).
+ */
+static inline u64 tdx_hypercall_out_r11(u64 fn, u64 r12, u64 r13,
+					u64 r14, u64 r15)
+{
+
+	struct tdx_hypercall_output out = {0};
+	u64 err;
+
+	err = __tdx_hypercall(fn, r12, r13, r14, r15, &out);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return out.r11;
+}
+
 static inline bool cpuid_has_tdx_guest(void)
 {
 	u32 eax, signature[3];
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-05-19  5:58                 ` [RFC v2-fix-v1 " Kuppuswamy Sathyanarayanan
@ 2021-05-19  6:04                   ` Kuppuswamy, Sathyanarayanan
  2021-05-19 15:31                   ` Dave Hansen
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-19  6:04 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

Hi Dave,

On 5/18/21 10:58 PM, Kuppuswamy Sathyanarayanan wrote:
> Guests communicate with VMMs with hypercalls. Historically, these
> are implemented using instructions that are known to cause VMEXITs
> like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
> expose guest state to the host.  This prevents the old hypercall
> mechanisms from working. So to communicate with VMM, TDX
> specification defines a new instruction called "tdcall".
> 
> In TDX based VM, since VMM is an untrusted entity, a intermediary
> layer (TDX module) exists between host and guest to facilitate the
> secure communication. And "tdcall" instruction  is used by the guest
> to request services from TDX module. And a variant of "tdcall"
> instruction (with specific arguments as defined by GHCI) is used by
> the guest to request services from  VMM via the TDX module.
> 
> Implement common helper functions to communicate with the TDX Module
> and VMM (using TDCALL instruction).
>     
> __tdx_hypercall()    - function can be used to request services from
> 		       the VMM.
> __tdx_module_call()  - function can be used to communicate with the
> 		       TDX Module.
> 
> Also define two additional wrappers, tdx_hypercall() and
> tdx_hypercall_out_r11() to cover common use cases of
> __tdx_hypercall() function. Since each use case of
> __tdx_module_call() is different, we don't need such wrappers for it.
> 
> Implement __tdx_module_call() and __tdx_hypercall() helper functions
> in assembly.
> 
> Rationale behind choosing to use assembly over inline assembly are,
> 
> 1. Since the number of lines of instructions (with comments) in
> __tdx_hypercall() implementation is over 70, using inline assembly
> to implement it will make it hard to read.
>     
> 2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
> TDCALL operation, if all these registers are included in in-line
> assembly constraints, some of the older compilers may not
> be able to meet this requirement.
> 
> Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
> use the same set of argument registers. The implementation here picks
> the current worst-case scenario for TDCALL (4 registers). For TDCALLs
> with fewer than 4 arguments, there will end up being a few superfluous
> (cheap) instructions.  But, this approach maximizes code reuse. The
> same argument applies to __tdx_hypercall() function as well.
> 
> Current implementation of __tdx_hypercall() includes error handling
> (ud2 on failure case) in assembly function instead of doing it in C
> wrapper function. The reason behind this choice is, when adding support
> for in/out instructions (refer to patch titled "x86/tdx: Handle port
> I/O" in this series), we use alternative_io() to substitute in/out
> instruction with  __tdx_hypercall() calls. So use of C wrappers is not
> trivial in this case because the input parameters will be in the wrong
> registers and it's tricky to include proper buffer code to make this
> happen.
> 
> For registers used by TDCALL instruction, please check TDX GHCI
> specification, sec 2.4 and 3.
> 
> https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
> 
> Originally-by: Sean Christopherson<seanjc@google.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan<sathyanarayanan.kuppuswamy@linux.intel.com>

I did send it as in-reply-to message id 3a7c0bba-cc43-e4ba-f7fe-43c8627c2fc2@intel.com (your
last reply mail id), but for some reason its not detected as reply to original patch
"[RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions".

I am not sure whats going on, but please review as reply to original patch.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-05-19  5:58                 ` [RFC v2-fix-v1 " Kuppuswamy Sathyanarayanan
  2021-05-19  6:04                   ` Kuppuswamy, Sathyanarayanan
@ 2021-05-19 15:31                   ` Dave Hansen
  2021-05-19 19:09                     ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
  2021-05-19 19:13                     ` [RFC v2-fix-v1 " Kuppuswamy, Sathyanarayanan
  1 sibling, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-19 15:31 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

On 5/18/21 10:58 PM, Kuppuswamy Sathyanarayanan wrote:
> Guests communicate with VMMs with hypercalls. Historically, these
> are implemented using instructions that are known to cause VMEXITs
> like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
> expose guest state to the host.  This prevents the old hypercall
> mechanisms from working. So to communicate with VMM, TDX
> specification defines a new instruction called "tdcall".
> 
> In TDX based VM, since VMM is an untrusted entity, a intermediary

"In a TDX-based VM..."

> layer (TDX module) exists between host and guest to facilitate the
> secure communication. And "tdcall" instruction  is used by the guest
> to request services from TDX module. And a variant of "tdcall"
> instruction (with specific arguments as defined by GHCI) is used by
> the guest to request services from  VMM via the TDX module.

I'd just say:

	TDX guests communicate with the TDX module and with the VMM
	using a new instruction: TDCALL.

The rest of that is noise.

> Implement common helper functions to communicate with the TDX Module
> and VMM (using TDCALL instruction).
>    
> __tdx_hypercall()    - function can be used to request services from
> 		       the VMM.
> __tdx_module_call()  - function can be used to communicate with the
> 		       TDX Module.

s/function can be used to//

> Also define two additional wrappers, tdx_hypercall() and
> tdx_hypercall_out_r11() to cover common use cases of
> __tdx_hypercall() function. Since each use case of
> __tdx_module_call() is different, we don't need such wrappers for it.
> 
> Implement __tdx_module_call() and __tdx_hypercall() helper functions
> in assembly.
> 
> Rationale behind choosing to use assembly over inline assembly are,
> 
> 1. Since the number of lines of instructions (with comments) in
> __tdx_hypercall() implementation is over 70, using inline assembly
> to implement it will make it hard to read.
>    
> 2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
> TDCALL operation, if all these registers are included in in-line
> assembly constraints, some of the older compilers may not
> be able to meet this requirement.

Was this "older compiler" argument really the reason?

> Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
> use the same set of argument registers. The implementation here picks
> the current worst-case scenario for TDCALL (4 registers). For TDCALLs
> with fewer than 4 arguments, there will end up being a few superfluous
> (cheap) instructions.  But, this approach maximizes code reuse. The
> same argument applies to __tdx_hypercall() function as well.
> 
> Current implementation of __tdx_hypercall() includes error handling
> (ud2 on failure case) in assembly function instead of doing it in C
> wrapper function. The reason behind this choice is, when adding support
> for in/out instructions (refer to patch titled "x86/tdx: Handle port
> I/O" in this series), we use alternative_io() to substitute in/out
> instruction with  __tdx_hypercall() calls. So use of C wrappers is not
> trivial in this case because the input parameters will be in the wrong
> registers and it's tricky to include proper buffer code to make this
> happen.
> 
> For registers used by TDCALL instruction, please check TDX GHCI
> specification, sec 2.4 and 3.
> 
> https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
> 
> Originally-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

For what it's worth, that changelog really starts to ramble after the
"rationale" part.

> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 69af72d08d3d..211b9d66b1b1 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -8,12 +8,50 @@
>  #ifdef CONFIG_INTEL_TDX_GUEST
>  
>  #include <asm/cpufeature.h>
> +#include <linux/types.h>
> +
> +/*
> + * Used in __tdx_module_call() helper function to gather the
> + * output registers values of TDCALL instruction when requesting

There's something wrong in this sentence.  This needs to be "output
register values" or "output regisers' values".

> + * services from the TDX module. This is software only structure
> + * and not related to TDX module/VMM.
> + */
> +struct tdx_module_output {
> +	u64 rcx;
> +	u64 rdx;
> +	u64 r8;
> +	u64 r9;
> +	u64 r10;
> +	u64 r11;
> +};
> +
> +/*
> + * Used in __tdx_hypercall() helper function to gather the
> + * output registers values of TDCALL instruction when requesting
> + * services from the VMM. This is software only structure
> + * and not related to TDX module/VMM.
> + */
> +struct tdx_hypercall_output {
> +	u64 r11;
> +	u64 r12;
> +	u64 r13;
> +	u64 r14;
> +	u64 r15;
> +};
>  
>  /* Common API to check TDX support in decompression and common kernel code. */
>  bool is_tdx_guest(void);
>  
>  void __init tdx_early_init(void);
>  
> +/* Helper function used to communicate with the TDX module */
> +u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> +		      struct tdx_module_output *out);
> +
> +/* Helper function used to request services from VMM */
> +u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
> +		    struct tdx_hypercall_output *out);
> +
>  #else // !CONFIG_INTEL_TDX_GUEST
>  
>  static inline bool is_tdx_guest(void)
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index ea111bf50691..7966c10ea8d1 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
>  obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
>  
>  obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
> -obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
> +obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
>  
>  obj-$(CONFIG_EISA)		+= eisa.o
>  obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
> diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
> index 60b9f42ce3c1..e6b3bb983992 100644
> --- a/arch/x86/kernel/asm-offsets.c
> +++ b/arch/x86/kernel/asm-offsets.c
> @@ -23,6 +23,10 @@
>  #include <xen/interface/xen.h>
>  #endif
>  
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +#include <asm/tdx.h>
> +#endif
> +
>  #ifdef CONFIG_X86_32
>  # include "asm-offsets_32.c"
>  #else
> @@ -75,6 +79,24 @@ static void __used common(void)
>  	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
>  #endif
>  
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +	BLANK();
> +	/* Offset for fields in tdcall_output */
> +	OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
> +	OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
> +	OFFSET(TDX_MODULE_r8,  tdx_module_output, r8);
> +	OFFSET(TDX_MODULE_r9,  tdx_module_output, r9);
> +	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
> +	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
> +
> +	/* Offset for fields in tdvmcall_output */
> +	OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
> +	OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
> +	OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
> +	OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
> +	OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
> +#endif
> +
>  	BLANK();
>  	OFFSET(BP_scratch, boot_params, scratch);
>  	OFFSET(BP_secure_boot, boot_params, secure_boot);
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> new file mode 100644
> index 000000000000..a67c595e4169
> --- /dev/null
> +++ b/arch/x86/kernel/tdcall.S
> @@ -0,0 +1,222 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <asm/asm-offsets.h>
> +#include <asm/asm.h>
> +#include <asm/frame.h>
> +#include <asm/unwind_hints.h>
> +
> +#include <linux/linkage.h>
> +#include <linux/bits.h>
> +
> +#define TDG_R10		BIT(10)
> +#define TDG_R11		BIT(11)
> +#define TDG_R12		BIT(12)
> +#define TDG_R13		BIT(13)
> +#define TDG_R14		BIT(14)
> +#define TDG_R15		BIT(15)
> +
> +/*
> + * Expose registers R10-R15 to VMM. It is passed via RCX register
> + * to the TDX Module, which will be used by the TDX module to
> + * identify the list of registers exposed to VMM. Each bit in this
> + * mask represents a register ID. You can find the bit field details
> + * in TDX GHCI specification.
> + */
> +#define TDVMCALL_EXPOSE_REGS_MASK	( TDG_R10 | TDG_R11 | \
> +					  TDG_R12 | TDG_R13 | \
> +					  TDG_R14 | TDG_R15 )
> +
> +/*
> + * TDX guests use the TDCALL instruction to make requests to the
> + * TDX module and hypercalls to the VMM. It is supported in
> + * Binutils >= 2.36.
> + */
> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
> +
> +/*
> + * __tdx_module_call()  - Helper function used by TDX guests to request
> + * services from the TDX module (does not include VMM services).
> + *
> + * This function serves as a wrapper to move user call arguments to the
> + * correct registers as specified by "tdcall" ABI and shares it with the
> + * TDX module.  And if the "tdcall" operation is successful and a valid

It's frequently taught to never start a sentence with "And" in formal
writing.  You use it fairly frequently.  Simply removing it increase
readability, IMNHO.

> + * "struct tdx_module_output" pointer is available (in "out" argument),
> + * output from the TDX module is saved to the memory specified in the
> + * "out" pointer. Also the status of the "tdcall" operation is returned
> + * back to the user as a function return value.
> + *
> + * @fn  (RDI)		- TDCALL Leaf ID,    moved to RAX
> + * @rcx (RSI)		- Input parameter 1, moved to RCX
> + * @rdx (RDX)		- Input parameter 2, moved to RDX
> + * @r8  (RCX)		- Input parameter 3, moved to R8
> + * @r9  (R8)		- Input parameter 4, moved to R9
> + *
> + * @out (R9)		- struct tdx_module_output pointer
> + *			  stored temporarily in R12 (not
> + * 			  shared with the TDX module)
> + *
> + * Return status of tdcall via RAX.
> + *
> + * NOTE: This function should not be used for TDX hypercall
> + *       use cases.
> + */
> +SYM_FUNC_START(__tdx_module_call)
> +	FRAME_BEGIN
> +
> +	/*
> +	 * R12 will be used as temporary storage for
> +	 * struct tdx_module_output pointer. You can
> +	 * find struct tdx_module_output details in
> +	 * arch/x86/include/asm/tdx.h. Also note that
> +	 * registers R12-R15 are not used by TDCALL
> +	 * services supported by this helper function.
> +	 */
> +	push %r12	/* Callee saved, so preserve it */
> +	mov %r9,  %r12 	/* Move output pointer to R12 */
> +
> +	/* Mangle function call ABI into TDCALL ABI: */
> +	mov %rdi, %rax	/* Move TDCALL Leaf ID to RAX */
> +	mov %r8,  %r9	/* Move input 4 to R9 */
> +	mov %rcx, %r8	/* Move input 3 to R8 */
> +	mov %rsi, %rcx	/* Move input 1 to RCX */
> +	/* Leave input param 2 in RDX */
> +
> +	tdcall
> +
> +	/* Check for TDCALL success: 0 - Successful, otherwise failed */
> +	test %rax, %rax
> +	jnz 1f
> +
> +	/* Check for TDCALL output struct != NULL */
> +	test %r12, %r12
> +	jz 1f
> +
> +	/* Copy TDCALL result registers to output struct: */
> +	movq %rcx, TDX_MODULE_rcx(%r12)
> +	movq %rdx, TDX_MODULE_rdx(%r12)
> +	movq %r8,  TDX_MODULE_r8(%r12)
> +	movq %r9,  TDX_MODULE_r9(%r12)
> +	movq %r10, TDX_MODULE_r10(%r12)
> +	movq %r11, TDX_MODULE_r11(%r12)
> +1:
> +	pop %r12 /* Restore the state of R12 register */
> +
> +	FRAME_END
> +	ret
> +SYM_FUNC_END(__tdx_module_call)
> +
> +/*
> + * do_tdx_hypercall()  - Helper function used by TDX guests to request
> + * services from the VMM. All requests are made via the TDX module
> + * using "TDCALL" instruction.
> + *
> + * This function is created to contain common between vendor specific

This sentence seems wrong.  Common... what?

> + * and standard type tdx hypercalls. So the caller of this function had

Please capitalize "tdx" consistently.

> + * to set the TDVMCALL type in the R10 register before calling it.

> + * This function serves as a wrapper to move user call arguments to the
> + * correct registers as specified by "tdcall" ABI and shares it with VMM
> + * via the TDX module. And if the "tdcall" operation is successful and a
> + * valid "struct tdx_hypercall_output" pointer is available (in "out"
> + * argument), output from the VMM is saved to the memory specified in the
> + * "out" pointer. 
> + *
> + * @fn  (RDI)		- TDVMCALL function, moved to R11
> + * @r12 (RSI)		- Input parameter 1, moved to R12
> + * @r13 (RDX)		- Input parameter 2, moved to R13
> + * @r14 (RCX)		- Input parameter 3, moved to R14
> + * @r15 (R8)		- Input parameter 4, moved to R15
> + *
> + * @out (R9)		- struct tdx_hypercall_output pointer
> + *
> + * On successful completion, return TDX hypercall error code.
> + * If the "tdcall" operation fails, panic.
> + *
> + */

This sounds scary.  Can you try to differentate a hypercall failure from
a "tdcall" failure?

Actually, I think that's done OK below.  Just remove this mention of
panic().

> +SYM_FUNC_START_LOCAL(do_tdx_hypercall)
> +	/* Save non-volatile GPRs that are exposed to the VMM. */
> +	push %r15
> +	push %r14
> +	push %r13
> +	push %r12
> +
> +	/* Leave hypercall output pointer in R9, it's not clobbered by VMM */
> +
> +	/* Mangle function call ABI into TDCALL ABI: */
> +	xor %eax, %eax /* Move TDCALL leaf ID (TDVMCALL (0)) to RAX */
> +	mov %rdi, %r11 /* Move TDVMCALL function id to R11 */
> +	mov %rsi, %r12 /* Move input 1 to R12 */
> +	mov %rdx, %r13 /* Move input 2 to R13 */
> +	mov %rcx, %r14 /* Move input 1 to R14 */
> +	mov %r8,  %r15 /* Move input 1 to R15 */
> +	/* Caller of do_tdx_hypercall() will set TDVMCALL type in R10 */
> +
> +	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
> +
> +	tdcall
> +
> +	/*
> +	 * Check for TDCALL success: 0 - Successful, otherwise failed.
> +	 * If failed, there is an issue with TDX Module which is fatal
> +	 * for the guest. So panic. Also note that RAX is controlled
> +	 * only by the TDX module and not exposed to VMM.
> +	 */

I'd probably just say:

	/*
	 * Non-zero RAX values indicate a failure of TDCALL itself.
	 * Panic for those.  This value is unrelated to the hypercall
	 * result in R10.
	 */

> +	test %rax, %rax
> +	jnz 2f
> +
> +	/* Move hypercall error code to RAX to return to user */
> +	mov %r10, %rax
> +
> +	/* Check for hypercall success: 0 - Successful, otherwise failed */
> +	test %rax, %rax
> +	jnz 1f
> +
> +	/* Check for hypercall output struct != NULL */

This is a great example of a comment that's not using its space widely.
 If you're reading this, you *KNOW* that it's checking for NULL.  But
what does that *MEAN*?

Wh not:

	/* Check if caller provided an output struct */

> +	test %r9, %r9
> +	jz 1f
> +
> +	/* Copy hypercall result registers to output struct: */
> +	movq %r11, TDX_HYPERCALL_r11(%r9)
> +	movq %r12, TDX_HYPERCALL_r12(%r9)
> +	movq %r13, TDX_HYPERCALL_r13(%r9)
> +	movq %r14, TDX_HYPERCALL_r14(%r9)
> +	movq %r15, TDX_HYPERCALL_r15(%r9)
> +1:
> +	/*
> +	 * Zero out registers exposed to the VMM to avoid
> +	 * speculative execution with VMM-controlled values.
> +	 */

You can even say:

	This needs to include all registers present in
	TDVMCALL_EXPOSE_REGS_MASK

> +	xor %r10d, %r10d
> +	xor %r11d, %r11d
> +	xor %r12d, %r12d
> +	xor %r13d, %r13d
> +	xor %r14d, %r14d
> +	xor %r15d, %r15d
> +
> +	/* Restore non-volatile GPRs that are exposed to the VMM. */
> +	pop %r12
> +	pop %r13
> +	pop %r14
> +	pop %r15
> +
> +	ret
> +2:
> +	ud2
> +SYM_FUNC_END(do_tdx_hypercall)
> +
> +/*
> + * Helper function for for standard type of TDVMCALLs. This assembly
> + * wrapper lets us reuse do_tdvmcall() for standard type of hypercalls
> + * (R10 is set as zero).
> + */

Remember, no "us", "we" in changelogs or comments.

> +SYM_FUNC_START(__tdx_hypercall)
> +	FRAME_BEGIN
> +	/*
> +	 * R10 is not part of the function call ABI, but it is a part
> +	 * of the TDVMCALL ABI. So set it 0 for standard type TDVMCALL
> +	 * before making call to the do_tdx_hypercall().
> +	 */
> +	xor %r10, %r10
> +	call do_tdx_hypercall
> +	FRAME_END
> +	retq
> +SYM_FUNC_END(__tdx_hypercall)

The rest of it is fine.  Probably just one more rev to beef up the
comments and changelogs.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 29/32] x86/tdx: Add helper to do MapGPA TDVMALL
  2021-04-26 18:01 ` [RFC v2 29/32] x86/tdx: Add helper to do MapGPA TDVMALL Kuppuswamy Sathyanarayanan
@ 2021-05-19 15:59   ` Dave Hansen
  2021-05-20 23:14     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-19 15:59 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> MapGPA TDVMCALL requests the host VMM to map a GPA range as private or
> shared memory mappings. Shared GPA mappings can be used for
> communication beteen TD guest and host VMM, for example for
> paravirtualized IO.

As usual, I hate the changelog.  This appears to just be regurgitating
the spec.

Is this just for part of converting an existing mapping between private
and shared?  If so, please say that.

> The new helper tdx_map_gpa() provides access to the operation.

<sigh>  You got your own name wrong. It's tdg_map_gpa() in the patch.

BTW, I agree with Sean on this one: "tdg" is a horrible prefix.  You
just proved Sean's point by mistyping it.  *EVERYONE* is going to rpeat
that mistake: tdg -> tdx.

> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index dc80cf7f7d08..4789798d7737 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -7,6 +7,11 @@
>  
>  #ifndef __ASSEMBLY__
>  
> +enum tdx_map_type {
> +	TDX_MAP_PRIVATE,
> +	TDX_MAP_SHARED,
> +};

I like the enum, but please call out that this is a software construct,
not a part of any hardware or VMM ABI.

>  #ifdef CONFIG_INTEL_TDX_GUEST
>  
>  #include <asm/cpufeature.h>
> @@ -112,6 +117,8 @@ unsigned short tdg_inw(unsigned short port);
>  unsigned int tdg_inl(unsigned short port);
>  
>  extern phys_addr_t tdg_shared_mask(void);
> +extern int tdg_map_gpa(phys_addr_t gpa, int numpages,
> +		       enum tdx_map_type map_type);
>  
>  #else // !CONFIG_INTEL_TDX_GUEST
>  
> @@ -155,6 +162,12 @@ static inline phys_addr_t tdg_shared_mask(void)
>  {
>  	return 0;
>  }
> +
> +static inline int tdg_map_gpa(phys_addr_t gpa, int numpages,
> +			      enum tdx_map_type map_type)
> +{
> +	return -ENODEV;
> +}

FWIW, you could probably get away with just inlining tdg_map_gpa():

static inline int tdg_map_gpa(phys_addr_t gpa, int numpages, ...
{
	u64 ret;

	if (!IS_ENABLED(CONFIG_INTEL_TDX_GUEST))
		return -ENODEV;

	if (map_type == TDX_MAP_SHARED)
		gpa |= tdg_shared_mask();

	ret = tdvmcall(TDVMCALL_MAP_GPA, gpa, ...

	return ret ? -EIO : 0;
}

Then you don't have three copies of the function signature that can get
out of sync.

>  #endif /* CONFIG_INTEL_TDX_GUEST */
>  #endif /* __ASSEMBLY__ */
>  #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 7e391cd7aa2b..074136473011 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -15,6 +15,8 @@
>  #include "tdx-kvm.c"
>  #endif
>  
> +#define TDVMCALL_MAP_GPA	0x10001
> +
>  static struct {
>  	unsigned int gpa_width;
>  	unsigned long attributes;
> @@ -98,6 +100,17 @@ static void tdg_get_info(void)
>  	physical_mask &= ~tdg_shared_mask();
>  }
>  
> +int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
> +{
> +	u64 ret;
> +
> +	if (map_type == TDX_MAP_SHARED)
> +		gpa |= tdg_shared_mask();
> +
> +	ret = tdvmcall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0);
> +	return ret ? -EIO : 0;
> +}

The naming Intel chose here is nasty.  This doesn't "map" anything.  It
modifies an existing mapping from what I can tell.  We could name it
much better than the spec, perhaps:

	tdx_hcall_gpa_intent()

BTW, all of these hypercalls need a consistent prefix.

It also needs a comment:

	/*
	 * Inform the VMM of the guest's intent for this physical page:
	 * shared with the VMM or private to the guest.  The VMM is
	 * expected to change its mapping of the page in response.
	 *
	 * Note: shared->private conversions require further guest
	 * action to accept the page.
	 */

The intent here is important.  It makes it clear that this function
really only plays a role in the conversion process.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-04-26 18:01 ` [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Kuppuswamy Sathyanarayanan
  2021-05-19  5:00   ` Kuppuswamy, Sathyanarayanan
@ 2021-05-19 16:14   ` Dave Hansen
  2021-05-20 18:48     ` Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-19 16:14 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> tdx_shared_mask() returns the mask that has to be set in a page
> table entry to make page shared with VMM.

Here's a rewrite:

Just like MKTME, TDX reassigns bits of the physical address for
metadata.  MKTME used several bits for an encryption KeyID.  TDX uses a
single bit in guests to communicate whether a physical page should be
protected by TDX as private memory (bit set to 0) or unprotected and
shared with the VMM (bit set to 1).

Add a helper, tdg_shared_mask() (bad name please fix it) to generate the
mask.  The processor enumerates its physical address width to include
the shared bit, which means it gets included in __PHYSICAL_MASK by default.

Remove the shared mask from 'physical_mask' since any bits in
tdg_shared_mask() are not used for physical addresses in page table entries.

--

BTW, do you find it confusing that the subject says: '__PHYSICAL_MASK'
and yet the code only modifies 'physical_mask'?

> Also, note that we cannot club shared mapping configuration between
> AMD SME and Intel TDX Guest platforms in common function. SME has
> to do it very early in __startup_64() as it sets the bit on all
> memory, except what is used for communication. TDX can postpone as
> we don't need any shared mapping in very early boot.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/Kconfig           | 1 +
>  arch/x86/include/asm/tdx.h | 6 ++++++
>  arch/x86/kernel/tdx.c      | 9 +++++++++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 67f99bf27729..5f92e8205de2 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -882,6 +882,7 @@ config INTEL_TDX_GUEST
>  	select PARAVIRT_XL
>  	select X86_X2APIC
>  	select SECURITY_LOCKDOWN_LSM
> +	select X86_MEM_ENCRYPT_COMMON
>  	help
>  	  Provide support for running in a trusted domain on Intel processors
>  	  equipped with Trusted Domain eXtenstions. TDX is an new Intel
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index b972c6531a53..dc80cf7f7d08 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -111,6 +111,8 @@ unsigned char tdg_inb(unsigned short port);
>  unsigned short tdg_inw(unsigned short port);
>  unsigned int tdg_inl(unsigned short port);
>  
> +extern phys_addr_t tdg_shared_mask(void);
> +
>  #else // !CONFIG_INTEL_TDX_GUEST
>  
>  static inline bool is_tdx_guest(void)
> @@ -149,6 +151,10 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
>  	return -ENODEV;
>  }
>  
> +static inline phys_addr_t tdg_shared_mask(void)
> +{
> +	return 0;
> +}
>  #endif /* CONFIG_INTEL_TDX_GUEST */
>  #endif /* __ASSEMBLY__ */
>  #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 1f1bb98e1d38..7e391cd7aa2b 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -76,6 +76,12 @@ bool is_tdx_guest(void)
>  }
>  EXPORT_SYMBOL_GPL(is_tdx_guest);
>  
> +/* The highest bit of a guest physical address is the "sharing" bit */
> +phys_addr_t tdg_shared_mask(void)
> +{
> +	return 1ULL << (td_info.gpa_width - 1);
> +}

Why not just inline this thing?  Functions don't get any smaller than
that.  Or does it not get used anywhere else?  Or are you concerned
about exporting td_info?

>  static void tdg_get_info(void)
>  {
>  	u64 ret;
> @@ -87,6 +93,9 @@ static void tdg_get_info(void)
>  
>  	td_info.gpa_width = out.rcx & GENMASK(5, 0);
>  	td_info.attributes = out.rdx;
> +
> +	/* Exclude Shared bit from the __PHYSICAL_MASK */
> +	physical_mask &= ~tdg_shared_mask();
>  }
>  
>  static __cpuidle void tdg_halt(void)
> 


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/boot: Avoid #VE during boot for TDX platforms
  2021-05-18  0:59     ` [WARNING: UNSCANNABLE EXTRACTION FAILED][WARNING: UNSCANNABLE EXTRACTION FAILED][RFC v2-fix 1/1] x86/boot: Avoid #VE during boot for TDX platforms Kuppuswamy Sathyanarayanan
@ 2021-05-19 16:53       ` Dave Hansen
  2021-05-21 14:35         ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-19 16:53 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson

On 5/17/21 5:59 PM, Kuppuswamy Sathyanarayanan wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Avoid operations which will inject #VE during boot process,
> which is obviously fatal for TDX platforms.

It's not "obviously fatal".  We actually have early exception handlers.
 Please give an actual reason.  "They're easy to avoid, and that sure
beats handling the exceptions" is a perfectly fine reason.

> Details are,
> 
> 1. TDX module injects #VE if a TDX guest attempts to write
>    EFER.
>    
>    Boot code updates EFER in following cases:
>    
>    * When enabling Long Mode configuration, EFER.LME bit will
>      be set. Since TDX forces EFER.LME=1, we can skip updating
>      it again. Check for EFER.LME before updating it and skip
>      it if it is already set.
> 
>    * EFER is also updated to enable support for features like
>      System call and No Execute page setting. In TDX, these
>      features are set up by the TDX module. So check whether
>      it is already enabled, and skip enabling it again.
>    
> 2. TDX module also injects a #VE if the guest attempts to clear
>    CR0.NE. Ensure CR0.NE is set when loading CR0 during compressed
>    boot. The Setting CR0.NE should be a nop on all CPUs that
>    support 64-bit mode.
>    
> 3. The TDX-Module (effectively part of the hypervisor) requires

So, after we've mentioned the TDX module a few times, *NOW* we feel the
need to explain what it is?  I'm also baffled by this little aside.
Literally the WHOLE POINT FOR SEAM TO EXIST is that it is NOT PART OF
THE HYPERVISOR.  The whole point.  Literally.

>    CR4.MCE to be set at all times and injects a #VE if the guest
>    attempts to clear CR4.MCE. So, preserve CR4.MCE instead of
>    clearing it during boot to avoid #VE.

This is a good example of a changelog run amok.  It doesn't need to be
an English language reproduction of the code.  This is getting close.

This can all be replaced and improved with a high-level discussion of
what is going on:

	There are a few MSRs and control register bits which the kernel
	normally needs to modify during boot.  But, TDX disallows
	modification of these registers to help provide consistent
	security guarantees.  Fortunately, TDX ensures that these are
	all in the correct state before the kernel loads, which means
	the kernel has no need to modify them.

	The conditions we need to avoid are:
	1. Any writes to the EFER MSR
	2. Clearing CR0.NE
	3. Clearing CR3.MCE

> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index e94874f4bbc1..2d79e5f97360 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -616,12 +616,16 @@ SYM_CODE_START(trampoline_32bit_src)
>  	movl	$MSR_EFER, %ecx
>  	rdmsr
>  	btsl	$_EFER_LME, %eax
> +	jc	1f
>  	wrmsr
> -	popl	%edx
> +1:	popl	%edx

A comment would be nice:

	/* Avoid writing EFER if no change was made (for TDX guest) */

>  	popl	%ecx
>  
>  	/* Enable PAE and LA57 (if required) paging modes */
> -	movl	$X86_CR4_PAE, %eax
> +	movl	%cr4, %eax
> +	/* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
> +	andl	$X86_CR4_MCE, %eax

Maybe I'm just dense today, but I was boggling about what this 'andl' is
actually doing.  This would help:

	/*
	 * Clear all bits except CR4.MCE, which is preserved.
	 * Clearing CR4.MCE will #VE in TDX guests.
	 */

> +	orl	$X86_CR4_PAE, %eax
>  	testl	%edx, %edx
>  	jz	1f
>  	orl	$X86_CR4_LA57, %eax
> @@ -636,7 +640,7 @@ SYM_CODE_START(trampoline_32bit_src)
>  	pushl	%eax
>  
>  	/* Enable paging again */
> -	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
> +	movl	$(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
>  	movl	%eax, %cr0

Shouldn't we also comment the X86_CR0_NE?

	/* Enable paging again.  Avoid clearing X86_CR0_NE for TDX. */

>  	lret
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 04bddaaba8e2..92c77cf75542 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -141,7 +141,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
>  1:
>  
>  	/* Enable PAE mode, PGE and LA57 */
> -	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
> +	movq	%cr4, %rcx
> +	/* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
> +	andl	$X86_CR4_MCE, %ecx
> +	orl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx

Ditto on the comment from above about clearing/preserving bits.

>  #ifdef CONFIG_X86_5LEVEL
>  	testl	$1, __pgtable_l5_enabled(%rip)
>  	jz	1f
> @@ -229,13 +232,19 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
>  	/* Setup EFER (Extended Feature Enable Register) */
>  	movl	$MSR_EFER, %ecx
>  	rdmsr
> +	movl    %eax, %edx

Comment, please.

>  	btsl	$_EFER_SCE, %eax	/* Enable System Call */
>  	btl	$20,%edi		/* No Execute supported? */
>  	jnc     1f
>  	btsl	$_EFER_NX, %eax
>  	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
> -1:	wrmsr				/* Make changes effective */
>  
> +	/* Skip the WRMSR if the current value matches the desired value. */

If I read this comment in 5 years, I'm going to ask "Why bother?".
Please mention TDX.

> +1:	cmpl	%edx, %eax
> +	je	1f
> +	xor	%edx, %edx
> +	wrmsr				/* Make changes effective */
> +1:
>  	/* Setup cr0 */
>  	movl	$CR0_STATE, %eax
>  	/* Make changes effective */
> diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
> index 754f8d2ac9e8..12b734b1da8b 100644
> --- a/arch/x86/realmode/rm/trampoline_64.S
> +++ b/arch/x86/realmode/rm/trampoline_64.S
> @@ -143,13 +143,20 @@ SYM_CODE_START(startup_32)
>  	movl	%eax, %cr3
>  
>  	# Set up EFER
> +	movl	$MSR_EFER, %ecx
> +	rdmsr
> +	cmp	pa_tr_efer, %eax
> +	jne	.Lwrite_efer
> +	cmp	pa_tr_efer + 4, %edx

Comment, please:

	# Skip EFER writes to avoid faults in TDX guests

> +	je	.Ldone_efer
> +.Lwrite_efer:
>  	movl	pa_tr_efer, %eax
>  	movl	pa_tr_efer + 4, %edx
> -	movl	$MSR_EFER, %ecx
>  	wrmsr
>  
> +.Ldone_efer:
>  	# Enable paging and in turn activate Long Mode
> -	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
> +	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
>  	movl	%eax, %cr0
>  
>  	/*
> 


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-05-19 15:31                   ` Dave Hansen
@ 2021-05-19 19:09                     ` Kuppuswamy Sathyanarayanan
  2021-05-19 19:13                     ` [RFC v2-fix-v1 " Kuppuswamy, Sathyanarayanan
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-19 19:09 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kuppuswamy Sathyanarayanan

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
expose guest state to the host.  This prevents the old hypercall
mechanisms from working. So to communicate with VMM, TDX
specification defines a new instruction called "tdcall".

In a TDX based VM, since VMM is an untrusted entity, a intermediary
layer (TDX module) exists between host and guest to facilitate the
secure communication. TDX guests communicate with the TDX module and
with the VMM using a new instruction: TDCALL.

Implement common helper functions to communicate with the TDX Module
and VMM (using TDCALL instruction).
   
__tdx_hypercall()    - request services from the VMM.
__tdx_module_call()  - communicate with the TDX Module.

Also define two additional wrappers, tdx_hypercall() and
tdx_hypercall_out_r11() to cover common use cases of
__tdx_hypercall() function. Since each use case of
__tdx_module_call() is different, we don't need such wrappers for it.

Implement __tdx_module_call() and __tdx_hypercall() helper functions
in assembly.

Rationale behind choosing to use assembly over inline assembly is,
since the number of lines of instructions (with comments) in
__tdx_hypercall() implementation is over 70, using inline assembly
to implement it will make it hard to read.
   
Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
use the same set of argument registers. The implementation here picks
the current worst-case scenario for TDCALL (4 registers). For TDCALLs
with fewer than 4 arguments, there will end up being a few superfluous
(cheap) instructions.  But, this approach maximizes code reuse. The
same argument applies to __tdx_hypercall() function as well.

Current implementation of __tdx_hypercall() includes error handling
(ud2 on failure case) in assembly function instead of doing it in C
wrapper function. The reason behind this choice is, when adding support
for in/out instructions (refer to patch titled "x86/tdx: Handle port
I/O" in this series), we use alternative_io() to substitute in/out
instruction with  __tdx_hypercall() calls. So use of C wrappers is not
trivial in this case because the input parameters will be in the wrong
registers and it's tricky to include proper buffer code to make this
happen.

For registers used by TDCALL instruction, please check TDX GHCI
specification, sec 2.4 and 3.

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Originally-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-fix-v1:
 * Fixed commit log and comment corrections as suggested by Dave.

 arch/x86/include/asm/tdx.h    |  38 ++++++
 arch/x86/kernel/Makefile      |   2 +-
 arch/x86/kernel/asm-offsets.c |  22 ++++
 arch/x86/kernel/tdcall.S      | 223 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c         |  39 ++++++
 5 files changed, 323 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..fcd42119a287 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,50 @@
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
+#include <linux/types.h>
+
+/*
+ * Used in __tdx_module_call() helper function to gather the
+ * output registers' values of TDCALL instruction when requesting
+ * services from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_module_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+/*
+ * Used in __tdx_hypercall() helper function to gather the
+ * output registers' values of TDCALL instruction when requesting
+ * services from the VMM. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_hypercall_output {
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+};
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
 void __init tdx_early_init(void);
 
+/* Helper function used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		      struct tdx_module_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+		    struct tdx_hypercall_output *out);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..e6b3bb983992 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
 #include <xen/interface/xen.h>
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
 #ifdef CONFIG_X86_32
 # include "asm-offsets_32.c"
 #else
@@ -75,6 +79,24 @@ static void __used common(void)
 	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+	BLANK();
+	/* Offset for fields in tdcall_output */
+	OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+	OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+	OFFSET(TDX_MODULE_r8,  tdx_module_output, r8);
+	OFFSET(TDX_MODULE_r9,  tdx_module_output, r9);
+	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
+	/* Offset for fields in tdvmcall_output */
+	OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
+	OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
+	OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
+	OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
+	OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
+#endif
+
 	BLANK();
 	OFFSET(BP_scratch, boot_params, scratch);
 	OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..b06e8b62dfe2
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,223 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+
+#define TDG_R10		BIT(10)
+#define TDG_R11		BIT(11)
+#define TDG_R12		BIT(12)
+#define TDG_R13		BIT(13)
+#define TDG_R14		BIT(14)
+#define TDG_R15		BIT(15)
+
+/*
+ * Expose registers R10-R15 to VMM. It is passed via RCX register
+ * to the TDX Module, which will be used by the TDX module to
+ * identify the list of registers exposed to VMM. Each bit in this
+ * mask represents a register ID. You can find the bit field details
+ * in TDX GHCI specification.
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK	( TDG_R10 | TDG_R11 | \
+					  TDG_R12 | TDG_R13 | \
+					  TDG_R14 | TDG_R15 )
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdx_module_call()  - Helper function used by TDX guests to request
+ * services from the TDX module (does not include VMM services).
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with the
+ * TDX module. If the "tdcall" operation is successful and a valid
+ * "struct tdx_module_output" pointer is available (in "out" argument),
+ * output from the TDX module is saved to the memory specified in the
+ * "out" pointer. Also the status of the "tdcall" operation is returned
+ * back to the user as a function return value.
+ *
+ * @fn  (RDI)		- TDCALL Leaf ID,    moved to RAX
+ * @rcx (RSI)		- Input parameter 1, moved to RCX
+ * @rdx (RDX)		- Input parameter 2, moved to RDX
+ * @r8  (RCX)		- Input parameter 3, moved to R8
+ * @r9  (R8)		- Input parameter 4, moved to R9
+ *
+ * @out (R9)		- struct tdx_module_output pointer
+ *			  stored temporarily in R12 (not
+ * 			  shared with the TDX module)
+ *
+ * Return status of tdcall via RAX.
+ *
+ * NOTE: This function should not be used for TDX hypercall
+ *       use cases.
+ */
+SYM_FUNC_START(__tdx_module_call)
+	FRAME_BEGIN
+
+	/*
+	 * R12 will be used as temporary storage for
+	 * struct tdx_module_output pointer. You can
+	 * find struct tdx_module_output details in
+	 * arch/x86/include/asm/tdx.h. Also note that
+	 * registers R12-R15 are not used by TDCALL
+	 * services supported by this helper function.
+	 */
+	push %r12	/* Callee saved, so preserve it */
+	mov %r9,  %r12 	/* Move output pointer to R12 */
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	mov %rdi, %rax	/* Move TDCALL Leaf ID to RAX */
+	mov %r8,  %r9	/* Move input 4 to R9 */
+	mov %rcx, %r8	/* Move input 3 to R8 */
+	mov %rsi, %rcx	/* Move input 1 to RCX */
+	/* Leave input param 2 in RDX */
+
+	tdcall
+
+	/* Check for TDCALL success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check if caller provided an output struct */
+	test %r12, %r12
+	jz 1f
+
+	/* Copy TDCALL result registers to output struct: */
+	movq %rcx, TDX_MODULE_rcx(%r12)
+	movq %rdx, TDX_MODULE_rdx(%r12)
+	movq %r8,  TDX_MODULE_r8(%r12)
+	movq %r9,  TDX_MODULE_r9(%r12)
+	movq %r10, TDX_MODULE_r10(%r12)
+	movq %r11, TDX_MODULE_r11(%r12)
+1:
+	pop %r12 /* Restore the state of R12 register */
+
+	FRAME_END
+	ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * do_tdx_hypercall()  - Helper function used by TDX guests to request
+ * services from the VMM. All requests are made via the TDX module
+ * using "TDCALL" instruction.
+ *
+ * This function is created to contain common code between vendor
+ * specific and standard type TDX hypercalls. So the caller of this
+ * function had to set the TDVMCALL type in the R10 register before
+ * calling it.
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with VMM
+ * via the TDX module. If the "tdcall" operation is successful and a
+ * valid "struct tdx_hypercall_output" pointer is available (in "out"
+ * argument), output from the VMM is saved to the memory specified in the
+ * "out" pointer. 
+ *
+ * @fn  (RDI)		- TDVMCALL function, moved to R11
+ * @r12 (RSI)		- Input parameter 1, moved to R12
+ * @r13 (RDX)		- Input parameter 2, moved to R13
+ * @r14 (RCX)		- Input parameter 3, moved to R14
+ * @r15 (R8)		- Input parameter 4, moved to R15
+ *
+ * @out (R9)		- struct tdx_hypercall_output pointer
+ *
+ * On successful completion, return TDX hypercall error code.
+ *
+ */
+SYM_FUNC_START_LOCAL(do_tdx_hypercall)
+	/* Save non-volatile GPRs that are exposed to the VMM. */
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+
+	/* Leave hypercall output pointer in R9, it's not clobbered by VMM */
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	xor %eax, %eax /* Move TDCALL leaf ID (TDVMCALL (0)) to RAX */
+	mov %rdi, %r11 /* Move TDVMCALL function id to R11 */
+	mov %rsi, %r12 /* Move input 1 to R12 */
+	mov %rdx, %r13 /* Move input 2 to R13 */
+	mov %rcx, %r14 /* Move input 1 to R14 */
+	mov %r8,  %r15 /* Move input 1 to R15 */
+	/* Caller of do_tdx_hypercall() will set TDVMCALL type in R10 */
+
+	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+	tdcall
+
+	/*
+	 * Non-zero RAX values indicate a failure of TDCALL itself.
+	 * Panic for those.  This value is unrelated to the hypercall
+	 * result in R10.
+	 */
+	test %rax, %rax
+	jnz 2f
+
+	/* Move hypercall error code to RAX to return to user */
+	mov %r10, %rax
+
+	/* Check for hypercall success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check if caller provided an output struct */
+	test %r9, %r9
+	jz 1f
+
+	/* Copy hypercall result registers to output struct: */
+	movq %r11, TDX_HYPERCALL_r11(%r9)
+	movq %r12, TDX_HYPERCALL_r12(%r9)
+	movq %r13, TDX_HYPERCALL_r13(%r9)
+	movq %r14, TDX_HYPERCALL_r14(%r9)
+	movq %r15, TDX_HYPERCALL_r15(%r9)
+1:
+	/*
+	 * Zero out registers exposed to the VMM to avoid
+	 * speculative execution with VMM-controlled values.
+	 * This needs to include all registers present in
+	 * TDVMCALL_EXPOSE_REGS_MASK.
+	 */
+	xor %r10d, %r10d
+	xor %r11d, %r11d
+	xor %r12d, %r12d
+	xor %r13d, %r13d
+	xor %r14d, %r14d
+	xor %r15d, %r15d
+
+	/* Restore non-volatile GPRs that are exposed to the VMM. */
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	ret
+2:
+	ud2
+SYM_FUNC_END(do_tdx_hypercall)
+
+/*
+ * Helper function for standard type of TDVMCALLs. This assembly
+ * wrapper reuses do_tdvmcall() for standard type of hypercalls
+ * (R10 is set as zero).
+ */
+SYM_FUNC_START(__tdx_hypercall)
+	FRAME_BEGIN
+	/*
+	 * R10 is not part of the function call ABI, but it is a part
+	 * of the TDVMCALL ABI. So set it 0 for standard type TDVMCALL
+	 * before making call to the do_tdx_hypercall().
+	 */
+	xor %r10, %r10
+	call do_tdx_hypercall
+	FRAME_END
+	retq
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 6a7193fead08..cbfefc42641e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -1,8 +1,47 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (C) 2020 Intel Corporation */
 
+#define pr_fmt(fmt) "TDX: " fmt
+
 #include <asm/tdx.h>
 
+/*
+ * Wrapper for simple hypercalls that only return a success/error code.
+ */
+static inline u64 tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+	u64 err;
+
+	err = __tdx_hypercall(fn, r12, r13, r14, r15, NULL);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return err;
+}
+
+/*
+ * Wrapper for the semi-common case where we need single output
+ * value (R11). Callers of this function does not care about the
+ * hypercall error code (mainly for IN or MMIO usecase).
+ */
+static inline u64 tdx_hypercall_out_r11(u64 fn, u64 r12, u64 r13,
+					u64 r14, u64 r15)
+{
+
+	struct tdx_hypercall_output out = {0};
+	u64 err;
+
+	err = __tdx_hypercall(fn, r12, r13, r14, r15, &out);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return out.r11;
+}
+
 static inline bool cpuid_has_tdx_guest(void)
 {
 	u32 eax, signature[3];
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-05-19 15:31                   ` Dave Hansen
  2021-05-19 19:09                     ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
@ 2021-05-19 19:13                     ` Kuppuswamy, Sathyanarayanan
  2021-05-19 20:09                       ` Sean Christopherson
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-19 19:13 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel



On 5/19/21 8:31 AM, Dave Hansen wrote:
> Was this "older compiler" argument really the reason?

It is a speculation. I haven't tried to reproduce it with old compiler. So
I have removed that point.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-05-19 19:13                     ` [RFC v2-fix-v1 " Kuppuswamy, Sathyanarayanan
@ 2021-05-19 20:09                       ` Sean Christopherson
  2021-05-19 20:49                         ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-19 20:09 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

On Wed, May 19, 2021, Kuppuswamy, Sathyanarayanan wrote:
> 
> On 5/19/21 8:31 AM, Dave Hansen wrote:
> > Was this "older compiler" argument really the reason?
> 
> It is a speculation. I haven't tried to reproduce it with old compiler. So
> I have removed that point.

It's not "older" compilers.  gcc does not support R8-R15 as input/output
constraints, which means inline asm needs to do register shenanigans, and those
are horribly fragile because the compiler does not ensure register variables are
preserved outside of asm blobs.  E.g. adding a print like so can corrupt r10,
which makes it an absolute nightmare to debug/trace flows that pass r8-r15 to
asm blobs since looking at the code the wrong way can break things.

	register unsigned long r10 asm("r10") = __r10;

	pr_info("TDCALL: RAX = %lx, R10 = %lx\n", rax, __r10);

	asm volatile("tdcall"
		     : "=a"(rax) :
		     : "a"(rax), "r"(r10));

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-05-19 20:09                       ` Sean Christopherson
@ 2021-05-19 20:49                         ` Andi Kleen
  2021-05-27  0:30                           ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-19 20:49 UTC (permalink / raw)
  To: Sean Christopherson, Kuppuswamy, Sathyanarayanan
  Cc: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, linux-kernel


On 5/19/2021 1:09 PM, Sean Christopherson wrote:
> On Wed, May 19, 2021, Kuppuswamy, Sathyanarayanan wrote:
>> On 5/19/21 8:31 AM, Dave Hansen wrote:
>>> Was this "older compiler" argument really the reason?
>> It is a speculation. I haven't tried to reproduce it with old compiler. So
>> I have removed that point.
> It's not "older" compilers.  gcc does not support R8-R15 as input/output
> constraints,

Yes that's true, but they can be in clobbers. So it usually just needs a 
mov from the input arguments for output, or a mov to the output 
arguments for output.




^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-05-18  4:08           ` Dan Williams
@ 2021-05-20  0:18             ` Kuppuswamy, Sathyanarayanan
  2021-05-20  0:40               ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-20  0:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Sean Christopherson, Kai Huang

Hi Dan,

On 5/17/21 9:08 PM, Dan Williams wrote:
>> SYM_DATA_START_LOCAL(tr_idt)
>>           .short  0
>>           .quad   0
>> SYM_DATA_END(tr_idt)
> This format implies that tr_idt is reserving space for 2 distinct data
> structure attributes of those sizes, can you just put those names here
> as comments? Otherwise the .fill format is more compact.

Initially its 6 bytes (2 bytes for IDT limit, 4 bytes for 32 bit linear
start address). This patch extends it by another 4 bytes for supporting
64 bit mode.

2 bytes IDT limit (.short)
8 bytes for 64 bit IDT start address (.quad)

This info is included in commit log. But I will add comment here as you
have mentioned.

Will following comment log do ?

/* Use 10 bytes for IDT (in 64 bit mode), 8 bytes for IDT start address
    2 bytes for IDT limit size */

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-05-20  0:18             ` Kuppuswamy, Sathyanarayanan
@ 2021-05-20  0:40               ` Dan Williams
  2021-05-20  0:42                 ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-20  0:40 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Sean Christopherson, Kai Huang

On Wed, May 19, 2021 at 5:19 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> Hi Dan,
>
> On 5/17/21 9:08 PM, Dan Williams wrote:
> >> SYM_DATA_START_LOCAL(tr_idt)
> >>           .short  0
> >>           .quad   0
> >> SYM_DATA_END(tr_idt)
> > This format implies that tr_idt is reserving space for 2 distinct data
> > structure attributes of those sizes, can you just put those names here
> > as comments? Otherwise the .fill format is more compact.
>
> Initially its 6 bytes (2 bytes for IDT limit, 4 bytes for 32 bit linear
> start address). This patch extends it by another 4 bytes for supporting
> 64 bit mode.
>
> 2 bytes IDT limit (.short)
> 8 bytes for 64 bit IDT start address (.quad)
>
> This info is included in commit log. But I will add comment here as you
> have mentioned.

Thanks. I only read commit logs when code comments fail.

>
> Will following comment log do ?
>
> /* Use 10 bytes for IDT (in 64 bit mode), 8 bytes for IDT start address
>     2 bytes for IDT limit size */

I would clarify how the boot code uses this:

"When a bootloader hands off to the kernel in 32-bit mode an IDT with
a 2-byte limit and 4-byte base is needed. When a boot loader hands off
to a kernel 64-bit mode the base address extends to 8-bytes. Reserve
enough space for either scenario."

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-05-20  0:40               ` Dan Williams
@ 2021-05-20  0:42                 ` Kuppuswamy, Sathyanarayanan
  2021-05-21 14:39                   ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-20  0:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Sean Christopherson, Kai Huang



On 5/19/21 5:40 PM, Dan Williams wrote:
> I would clarify how the boot code uses this:
> 
> "When a bootloader hands off to the kernel in 32-bit mode an IDT with
> a 2-byte limit and 4-byte base is needed. When a boot loader hands off
> to a kernel 64-bit mode the base address extends to 8-bytes. Reserve
> enough space for either scenario."

I will add it. Thanks.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-05-19 16:14   ` Dave Hansen
@ 2021-05-20 18:48     ` Kuppuswamy, Sathyanarayanan
  2021-05-20 18:56       ` Kuppuswamy, Sathyanarayanan
                         ` (2 more replies)
  0 siblings, 3 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-20 18:48 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

Hi Dave,

On 5/19/21 9:14 AM, Dave Hansen wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> tdx_shared_mask() returns the mask that has to be set in a page
>> table entry to make page shared with VMM.
> 
> Here's a rewrite:
> 
> Just like MKTME, TDX reassigns bits of the physical address for
> metadata.  MKTME used several bits for an encryption KeyID.  TDX uses a
> single bit in guests to communicate whether a physical page should be
> protected by TDX as private memory (bit set to 0) or unprotected and
> shared with the VMM (bit set to 1).
> 
> Add a helper, tdg_shared_mask() (bad name please fix it) to generate the

Initially we have used tdx_* prefix for the guest code. But when the code from
host side got merged together, we came across many name conflicts. So to
avoid such issues in future, we were asked not to use the "tdx_" prefix and
our alternative choice was "tdg_".

Also, IMO, "tdg" prefix is more meaningful for guest code (Trusted Domain Guest)
compared to "tdx" (Trusted Domain eXtensions). I know that it gets confusing
when grepping for TDX related changes. But since these functions are only used
inside arch/x86 it should not be too confusing.

Even if rename is requested, IMO, it is easier to do it in one patch over
making changes in all the patches. So if it is required, we can do it later
once these initial patches were merged.

> mask.  The processor enumerates its physical address width to include
> the shared bit, which means it gets included in __PHYSICAL_MASK by default.
> 
> Remove the shared mask from 'physical_mask' since any bits in
> tdg_shared_mask() are not used for physical addresses in page table entries.
> 
> --

Thanks. I will include it in next version.

> 
> BTW, do you find it confusing that the subject says: '__PHYSICAL_MASK'
> and yet the code only modifies 'physical_mask'?
> 
>> Also, note that we cannot club shared mapping configuration between
>> AMD SME and Intel TDX Guest platforms in common function. SME has
>> to do it very early in __startup_64() as it sets the bit on all
>> memory, except what is used for communication. TDX can postpone as
>> we don't need any shared mapping in very early boot.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> ---
>>   arch/x86/Kconfig           | 1 +
>>   arch/x86/include/asm/tdx.h | 6 ++++++
>>   arch/x86/kernel/tdx.c      | 9 +++++++++
>>   3 files changed, 16 insertions(+)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 67f99bf27729..5f92e8205de2 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -882,6 +882,7 @@ config INTEL_TDX_GUEST
>>   	select PARAVIRT_XL
>>   	select X86_X2APIC
>>   	select SECURITY_LOCKDOWN_LSM
>> +	select X86_MEM_ENCRYPT_COMMON
>>   	help
>>   	  Provide support for running in a trusted domain on Intel processors
>>   	  equipped with Trusted Domain eXtenstions. TDX is an new Intel
>> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
>> index b972c6531a53..dc80cf7f7d08 100644
>> --- a/arch/x86/include/asm/tdx.h
>> +++ b/arch/x86/include/asm/tdx.h
>> @@ -111,6 +111,8 @@ unsigned char tdg_inb(unsigned short port);
>>   unsigned short tdg_inw(unsigned short port);
>>   unsigned int tdg_inl(unsigned short port);
>>   
>> +extern phys_addr_t tdg_shared_mask(void);
>> +
>>   #else // !CONFIG_INTEL_TDX_GUEST
>>   
>>   static inline bool is_tdx_guest(void)
>> @@ -149,6 +151,10 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
>>   	return -ENODEV;
>>   }
>>   
>> +static inline phys_addr_t tdg_shared_mask(void)
>> +{
>> +	return 0;
>> +}
>>   #endif /* CONFIG_INTEL_TDX_GUEST */
>>   #endif /* __ASSEMBLY__ */
>>   #endif /* _ASM_X86_TDX_H */
>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>> index 1f1bb98e1d38..7e391cd7aa2b 100644
>> --- a/arch/x86/kernel/tdx.c
>> +++ b/arch/x86/kernel/tdx.c
>> @@ -76,6 +76,12 @@ bool is_tdx_guest(void)
>>   }
>>   EXPORT_SYMBOL_GPL(is_tdx_guest);
>>   
>> +/* The highest bit of a guest physical address is the "sharing" bit */
>> +phys_addr_t tdg_shared_mask(void)
>> +{
>> +	return 1ULL << (td_info.gpa_width - 1);
>> +}
> 
> Why not just inline this thing?  Functions don't get any smaller than
> that.  Or does it not get used anywhere else?  Or are you concerned
> about exporting td_info?

We don't want to export td_info. It has more information additional to shared
mask details. Any reason for suggesting to use inline?

This function is only used in following files.

arch/x86/include/asm/pgtable.h:25:#define pgprot_tdg_shared(prot) __pgprot(pgprot_val(prot) | 
tdg_shared_mask())
arch/x86/mm/pat/set_memory.c:1997:		mem_plain_bits = __pgprot(tdg_shared_mask());
arch/x86/kernel/tdx.c:134:phys_addr_t tdg_shared_mask(void)
arch/x86/kernel/tdx.c:274:	physical_mask &= ~tdg_shared_mask();


> 
>>   static void tdg_get_info(void)
>>   {
>>   	u64 ret;
>> @@ -87,6 +93,9 @@ static void tdg_get_info(void)
>>   
>>   	td_info.gpa_width = out.rcx & GENMASK(5, 0);
>>   	td_info.attributes = out.rdx;
>> +
>> +	/* Exclude Shared bit from the __PHYSICAL_MASK */
>> +	physical_mask &= ~tdg_shared_mask();
>>   }
>>   
>>   static __cpuidle void tdg_halt(void)
>>
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-05-20 18:48     ` Kuppuswamy, Sathyanarayanan
@ 2021-05-20 18:56       ` Kuppuswamy, Sathyanarayanan
  2021-05-20 19:33       ` Sean Christopherson
  2021-05-20 20:30       ` [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Dave Hansen
  2 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-20 18:56 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 5/20/21 11:48 AM, Kuppuswamy, Sathyanarayanan wrote:
> BTW, do you find it confusing that the subject says: '__PHYSICAL_MASK'
> and yet the code only modifies 'physical_mask'?

"physical_mask" is defined as __PHYSICAL_MASK in page_types.h. MM code seems to
use __PHYSICAL_MASK for common usage. But for our use case, if it makes it more
readable, I am fine with using "physical_mask".

arch/x86/include/asm/page_types.h:57:#define __PHYSICAL_MASK		physical_mask
arch/x86/mm/pat/memtype.c:560:		return address & __PHYSICAL_MASK;

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-05-20 18:48     ` Kuppuswamy, Sathyanarayanan
  2021-05-20 18:56       ` Kuppuswamy, Sathyanarayanan
@ 2021-05-20 19:33       ` Sean Christopherson
  2021-05-20 19:42         ` Kuppuswamy, Sathyanarayanan
  2021-05-20 20:30       ` [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Dave Hansen
  2 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-20 19:33 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel

On Thu, May 20, 2021, Kuppuswamy, Sathyanarayanan wrote:
> Hi Dave,
> 
> On 5/19/21 9:14 AM, Dave Hansen wrote:
> > On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > 
> > > tdx_shared_mask() returns the mask that has to be set in a page
> > > table entry to make page shared with VMM.
> > 
> > Here's a rewrite:
> > 
> > Just like MKTME, TDX reassigns bits of the physical address for
> > metadata.  MKTME used several bits for an encryption KeyID.  TDX uses a
> > single bit in guests to communicate whether a physical page should be
> > protected by TDX as private memory (bit set to 0) or unprotected and
> > shared with the VMM (bit set to 1).
> > 
> > Add a helper, tdg_shared_mask() (bad name please fix it) to generate the
> 
> Initially we have used tdx_* prefix for the guest code. But when the code from
> host side got merged together, we came across many name conflicts.

Whatever the conflicts are, they are by no means an unsolvable problem.  I am
more than happy to end up with slightly verbose names in KVM if that's what it
takes to avoid "tdg".

> So to avoid such issues in future, we were asked not to use the "tdx_" prefix
> and our alternative choice was "tdg_".

Who asked you not to use tdx_?  More specifically, did that feedback come from a
maintainer (or anyone on-list), or was it an Intel-internal decision?

> Also, IMO, "tdg" prefix is more meaningful for guest code (Trusted Domain Guest)
> compared to "tdx" (Trusted Domain eXtensions). I know that it gets confusing
> when grepping for TDX related changes. But since these functions are only used
> inside arch/x86 it should not be too confusing.
> 
> Even if rename is requested, IMO, it is easier to do it in one patch over
> making changes in all the patches. So if it is required, we can do it later
> once these initial patches were merged.

Hell no, we are not merging known bad crud that requires useless churn to get
things right.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-05-20 19:33       ` Sean Christopherson
@ 2021-05-20 19:42         ` Kuppuswamy, Sathyanarayanan
  2021-05-20 20:16           ` Sean Christopherson
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-20 19:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel



On 5/20/21 12:33 PM, Sean Christopherson wrote:
>> Initially we have used tdx_* prefix for the guest code. But when the code from
>> host side got merged together, we came across many name conflicts.
> Whatever the conflicts are, they are by no means an unsolvable problem.  I am
> more than happy to end up with slightly verbose names in KVM if that's what it
> takes to avoid "tdg".
> 
>> So to avoid such issues in future, we were asked not to use the "tdx_" prefix
>> and our alternative choice was "tdg_".
> Who asked you not to use tdx_?  More specifically, did that feedback come from a
> maintainer (or anyone on-list), or was it an Intel-internal decision?

It is the Intel internal feedback.

> 
>> Also, IMO, "tdg" prefix is more meaningful for guest code (Trusted Domain Guest)
>> compared to "tdx" (Trusted Domain eXtensions). I know that it gets confusing
>> when grepping for TDX related changes. But since these functions are only used
>> inside arch/x86 it should not be too confusing.
>>
>> Even if rename is requested, IMO, it is easier to do it in one patch over
>> making changes in all the patches. So if it is required, we can do it later
>> once these initial patches were merged.
> Hell no, we are not merging known bad crud that requires useless churn to get
> things right.

So what is your proposal? "tdx_guest_" / "tdx_host_" ?

If there is supposed be a rename, lets wait till we know about maintainers
feedback as well. If possible I would prefer not to go through another
rename.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-11  9:35             ` Borislav Petkov
@ 2021-05-20 20:12               ` Kuppuswamy, Sathyanarayanan
  2021-05-21 15:18                 ` Borislav Petkov
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-20 20:12 UTC (permalink / raw)
  To: Borislav Petkov, Sean Christopherson
  Cc: Dave Hansen, Andi Kleen, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel



On 5/11/21 2:35 AM, Borislav Petkov wrote:
> Preach brother!:)
> 
> /me goes and greps mailboxes...
> 
> ah, do you mean this, per chance:
> 
> https://lore.kernel.org/kvm/20210421144402.GB5004@zn.tnic/
> 
> ?
> 
> And yes, this has "sev" in the name and dhansen makes sense to me in
> wishing to unify all the protected guest feature queries under a common
> name. And then depending on the vendor, that common name will call the
> respective vendor's helper to answer the protected guest aspect asked
> about.
> 
> This way, generic code will call
> 
> 	protected_guest_has()
> 
> or so and be nicely abstracted away from the underlying implementation.
> 
> Hohumm, yap, sounds nice to me.
> 
> Thx.

I see many variants of SEV/SME related checks in the common code path
between TDX and SEV/SME. Can a generic call like
protected_guest_has(MEMORY_ENCRYPTION) or is_protected_guest()
replace all these variants?

We will not be able to test AMD related features. So I need to confirm
it with AMD code maintainers/developers before making this change.

arch/x86/include/asm/io.h:313:	if (sev_key_active() || is_tdx_guest()) {			\
arch/x86/include/asm/io.h:329:	if (sev_key_active() || is_tdx_guest()) {			\
arch/x86/kernel/pci-swiotlb.c:52:	if (sme_active() || is_tdx_guest())
arch/x86/mm/ioremap.c:96:	if (!sev_active() && !is_tdx_guest())
arch/x86/mm/pat/set_memory.c:1984:	if (!mem_encrypt_active() && !is_tdx_guest())

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-05-20 19:42         ` Kuppuswamy, Sathyanarayanan
@ 2021-05-20 20:16           ` Sean Christopherson
  2021-05-20 20:31             ` Andi Kleen
  2021-05-20 20:56             ` Dave Hansen
  0 siblings, 2 replies; 381+ messages in thread
From: Sean Christopherson @ 2021-05-20 20:16 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel

On Thu, May 20, 2021, Kuppuswamy, Sathyanarayanan wrote:
> So what is your proposal? "tdx_guest_" / "tdx_host_" ?

  1. Abstract things where appropriate, e.g. I'm guessing there is a clever way
     to deal with the shared vs. private inversion and avoid tdg_shared_mask
     altogether.

  2. Steal what SEV-ES did for the #VC handlers and use ve_ as the prefix for
     handlers.

  3. Use tdx_ everywhere else and handle the conflicts on a case-by-case basis
     with a healthy dose of common sense.  E.g. there should be no need to worry
     about "static __cpuidle void tdg_safe_halt(void)" colliding because neither
     the guest nor KVM should be exposing tdx_safe_halt() outside of its
     compilation unit.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-05-20 18:48     ` Kuppuswamy, Sathyanarayanan
  2021-05-20 18:56       ` Kuppuswamy, Sathyanarayanan
  2021-05-20 19:33       ` Sean Christopherson
@ 2021-05-20 20:30       ` Dave Hansen
  2 siblings, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-20 20:30 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 5/20/21 11:48 AM, Kuppuswamy, Sathyanarayanan wrote:
>>>   +/* The highest bit of a guest physical address is the "sharing"
>>> bit */
>>> +phys_addr_t tdg_shared_mask(void)
>>> +{
>>> +    return 1ULL << (td_info.gpa_width - 1);
>>> +}
>>
>> Why not just inline this thing?  Functions don't get any smaller than
>> that.  Or does it not get used anywhere else?  Or are you concerned
>> about exporting td_info?
> 
> We don't want to export td_info. It has more information additional to
> shared mask details. Any reason for suggesting to use inline?

My favorite reason is that it eliminates the need for three declarations:
1. An extern for the header
2. A stub for the header
3. The real function in the .c file.

An inline removes two places that might get out of sync in some way and
eliminates the need to check two implementation sites when grepping.

Not in this case, but in general, inlines also result in faster, more
compact code since the compiler has more visibility into what the
function does at its call sites.

Not wanting to export td_info _is_ a reasonable argument, though.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-05-20 20:16           ` Sean Christopherson
@ 2021-05-20 20:31             ` Andi Kleen
  2021-05-20 21:18               ` Sean Christopherson
  2021-05-20 20:56             ` Dave Hansen
  1 sibling, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-20 20:31 UTC (permalink / raw)
  To: Sean Christopherson, Kuppuswamy, Sathyanarayanan
  Cc: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, linux-kernel


On 5/20/2021 1:16 PM, Sean Christopherson wrote:
> On Thu, May 20, 2021, Kuppuswamy, Sathyanarayanan wrote:
>> So what is your proposal? "tdx_guest_" / "tdx_host_" ?
>    1. Abstract things where appropriate, e.g. I'm guessing there is a clever way
>       to deal with the shared vs. private inversion and avoid tdg_shared_mask
>       altogether.
>
>    2. Steal what SEV-ES did for the #VC handlers and use ve_ as the prefix for
>       handlers.
>
>    3. Use tdx_ everywhere else and handle the conflicts on a case-by-case basis
>       with a healthy dose of common sense.  E.g. there should be no need to worry
>       about "static __cpuidle void tdg_safe_halt(void)" colliding because neither
>       the guest nor KVM should be exposing tdx_safe_halt() outside of its
>       compilation unit.


Sorry Sean, but your suggestion is against all good code hygiene 
practices. Normally we try to pick unique prefixes for every module, and 
trying to coordinate with lots of other code that is maintained by other 
people is just a long term recipe for annoying merging problems.  Same 
with coordinating with SEV-ES for ve_.

Is it really that hard to adjust your grep patterns?

I'm not against changing tdg_, but if it's changed it should be 
something unique, and also not too long. Today tdg_ fits that criteria 
nicely.


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-05-20 20:16           ` Sean Christopherson
  2021-05-20 20:31             ` Andi Kleen
@ 2021-05-20 20:56             ` Dave Hansen
  2021-05-31 21:46               ` Kirill A. Shutemov
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-20 20:56 UTC (permalink / raw)
  To: Sean Christopherson, Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, linux-kernel

On 5/20/21 1:16 PM, Sean Christopherson wrote:
> On Thu, May 20, 2021, Kuppuswamy, Sathyanarayanan wrote:
>> So what is your proposal? "tdx_guest_" / "tdx_host_" ?
>   1. Abstract things where appropriate, e.g. I'm guessing there is a clever way
>      to deal with the shared vs. private inversion and avoid tdg_shared_mask
>      altogether.

One example here would be to keep a structure like:

struct protected_mem_config
{
	unsigned long p_set_bits;
	unsigned long p_clear_bits;
}

Where 'p_set_bits' are the bits that need to be set to establish memory
protection and 'p_clear_bits' are the bits that need to be cleared.
physical_mask would clear both of them:

	physical_mask &= ~(pmc.p_set_bits & pmc.p_set_bits);

Then, in a place like __set_memory_enc_dec(), you would query whether
memory protection was in place or not:
	
+	if (protect) {
+		cpa.mask_set = pmc.p_set_bits;
+		cpa.mask_clr = pmc.p_clear_bits;
+		map_type = TDX_MAP_PRIVATE;
+	} else {
+		cpa.mask_set = pmc.p_clear_bits;
+		cpa.mask_clr = pmc.p_set_bits;
+		map_type = TDX_MAP_SHARED;
+	}

The is_tdx_guest() if()'s would just go away.

Basically, if there's a is_tdx_guest() check in common code, it's a
place that might need an abstraction.

This, for instance:

> +	if (!ret && is_tdx_guest()) {
> +		ret = tdg_map_gpa(__pa(addr), numpages, map_type);
> +	}

could probably just be:

	if (!ret && is_protected_guest()) {
		ret = x86_vmm_protect(__pa(addr), numpages, protected);
	}

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-05-20 20:31             ` Andi Kleen
@ 2021-05-20 21:18               ` Sean Christopherson
  2021-05-20 21:23                 ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-20 21:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy, Sathyanarayanan, Dave Hansen, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel

On Thu, May 20, 2021, Andi Kleen wrote:
> 
> On 5/20/2021 1:16 PM, Sean Christopherson wrote:
> > On Thu, May 20, 2021, Kuppuswamy, Sathyanarayanan wrote:
> > > So what is your proposal? "tdx_guest_" / "tdx_host_" ?
> >    1. Abstract things where appropriate, e.g. I'm guessing there is a clever way
> >       to deal with the shared vs. private inversion and avoid tdg_shared_mask
> >       altogether.
> > 
> >    2. Steal what SEV-ES did for the #VC handlers and use ve_ as the prefix for
> >       handlers.
> > 
> >    3. Use tdx_ everywhere else and handle the conflicts on a case-by-case basis
> >       with a healthy dose of common sense.  E.g. there should be no need to worry
> >       about "static __cpuidle void tdg_safe_halt(void)" colliding because neither
> >       the guest nor KVM should be exposing tdx_safe_halt() outside of its
> >       compilation unit.
> 
> 
> Sorry Sean, but your suggestion is against all good code hygiene practices.
> Normally we try to pick unique prefixes for every module, and trying to
> coordinate with lots of other code that is maintained by other people is
> just a long term recipe for annoying merging problems.  Same with
> coordinating with SEV-ES for ve_.

For ve_?  SEV-ES uses vc_...

I'd buy that argument if series as a whole was consistent, but there are
individual function prototypes that aren't consistent, e.g.

+static int __tdg_map_gpa(phys_addr_t gpa, int numpages,
+                        enum tdx_map_type map_type)

a number of functions that use tdx_ isntead of tdg_ (I'll give y'all a break on
is_tdx_guest()), the files are all tdx.{c,h}, the shortlogs all use x86/tdx, the
comments all use TDX, and so on and so forth.

I understand the desire to have a unique prefix, but tdg is is _too_ close to
tdx.  I don't want to spend the next N years wondering if tdg is a typo or intended.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-05-20 21:18               ` Sean Christopherson
@ 2021-05-20 21:23                 ` Dave Hansen
  2021-05-20 21:28                   ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-20 21:23 UTC (permalink / raw)
  To: Sean Christopherson, Andi Kleen
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel

On 5/20/21 2:18 PM, Sean Christopherson wrote:
> I understand the desire to have a unique prefix, but tdg is is _too_ close to
> tdx.  I don't want to spend the next N years wondering if tdg is a typo or intended.


Sathya has even mis-typed "tdx" instead of "tdg" this in his own
changelogs up to this point.  That massively weakens the argument that
"tdg" is a good idea.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-05-20 21:23                 ` Dave Hansen
@ 2021-05-20 21:28                   ` Kuppuswamy, Sathyanarayanan
  2021-05-20 23:25                     ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-20 21:28 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson, Andi Kleen
  Cc: Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel



On 5/20/21 2:23 PM, Dave Hansen wrote:
> Sathya has even mis-typed "tdx" instead of "tdg" this in his own
> changelogs up to this point.  That massively weakens the argument that
> "tdg" is a good idea.

It is not a typo. But when we did the initial rename from "tdx_" -> "tdg_",
somehow I missed the change log change. That's why I am bit reluctant
to go for another rename (since we have scan change log, comments and code)
in all the patches.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 18:17             ` Sean Christopherson
@ 2021-05-20 22:47               ` Kirill A. Shutemov
  0 siblings, 0 replies; 381+ messages in thread
From: Kirill A. Shutemov @ 2021-05-20 22:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Andi Kleen, Dave Hansen, Kuppuswamy Sathyanarayanan,
	Peter Zijlstra, Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On Tue, May 18, 2021 at 06:17:04PM +0000, Sean Christopherson wrote:
> On Tue, May 18, 2021, Andi Kleen wrote:
> > > Why does this code exist at all?  TDX and SEV-ES absolutely must share code for
> > > handling MMIO reflection.  It will require a fair amount of refactoring to move
> > > the guts of vc_handle_mmio() to common code, but there is zero reason to maintain
> > > two separate versions of the opcode cracking.
> > 
> > While that's true on the high level, all the low level details are
> > different. We looked at unifying at some point, but it would have been a
> > callback hell. I don't think unifying would make anything cleaner.
> 
> How hard did you look?  The only part that _must_ be different between SEV and
> TDX is the hypercall itself, which is wholly contained at the very end of
> vc_do_mmio().

I've come up with the code below. decode_mmio() can be shared with SEV.

I don't have a testing setup for AMD. I can do a blind patch, but it would
be much more productive if someone on AMD side could look into this.

Any opinions?

enum mmio_type {
	MMIO_DECODE_FAILED,
	MMIO_WRITE,
	MMIO_WRITE_IMM,
	MMIO_READ,
	MMIO_READ_ZERO_EXTEND,
	MMIO_READ_SIGN_EXTEND,
	MMIO_MOVS,
};

static enum mmio_type decode_mmio(struct insn *insn, struct pt_regs *regs,
				  int *bytes)
{
	int type = MMIO_DECODE_FAILED;

	*bytes = 0;

	switch (insn->opcode.bytes[0]) {
	case 0x88: /* MOV m8,r8 */
		*bytes = 1;
		fallthrough;
	case 0x89: /* MOV m16/m32/m64, r16/m32/m64 */
		if (!*bytes)
			*bytes = insn->opnd_bytes;
		type = MMIO_WRITE;
		break;

	case 0xc6: /* MOV m8, imm8 */
		*bytes = 1;
		fallthrough;
	case 0xc7: /* MOV m16/m32/m64, imm16/imm32/imm64 */
		if (!*bytes)
			*bytes = insn->opnd_bytes;
		type = MMIO_WRITE_IMM;
		break;

	case 0x8a: /* MOV r8, m8 */
		*bytes = 1;
		fallthrough;
	case 0x8b: /* MOV r16/r32/r64, m16/m32/m64 */
		if (!*bytes)
			*bytes = insn->opnd_bytes;
		type = MMIO_READ;
		break;

	case 0xa4: /* MOVS m8, m8 */
		*bytes = 1;
		fallthrough;
	case 0xa5: /* MOVS m16/m32/m64, m16/m32/m64 */
		if (!*bytes)
			*bytes = insn->opnd_bytes;
		type = MMIO_MOVS;
		break;

	case 0x0f: /* Two-byte instruction */
		switch (insn->opcode.bytes[1]) {
		case 0xb6: /* MOVZX r16/r32/r64, m8 */
			*bytes = 1;
			fallthrough;
		case 0xb7: /* MOVZX r32/r64, m16 */
			if (!*bytes)
				*bytes = 2;
			type = MMIO_READ_ZERO_EXTEND;
			break;

		case 0xbe: /* MOVSX r16/r32/r64, m8 */
			*bytes = 1;
			fallthrough;
		case 0xbf: /* MOVSX r32/r64, m16 */
			if (!*bytes)
				*bytes = 2;
			type = MMIO_READ_SIGN_EXTEND;
			break;
		}
		break;
	}

	return type;
}

static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
	int size;
	unsigned long *reg;
	struct insn insn;
	unsigned long val = 0;

	kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
	insn_get_length(&insn);
	insn_get_opcode(&insn);

	reg = get_reg_ptr(&insn, regs);

	switch (decode_mmio(&insn, regs, &size)) {
	case MMIO_WRITE:
		memcpy(&val, reg, size);
		tdg_mmio(size, true, ve->gpa, val);
		break;
	case MMIO_WRITE_IMM:
		val = insn.immediate.value;
		tdg_mmio(size, true, ve->gpa, val);
		break;
	case MMIO_READ:
		val = tdg_mmio(size, false, ve->gpa, val);
		/* Zero-extend for 32-bit operation */
		if (size == 4)
			*reg = 0;
		memcpy(reg, &val, size);
		break;
	case MMIO_READ_ZERO_EXTEND:
		val = tdg_mmio(size, false, ve->gpa, val);

		/* Zero extend based on operand size */
		memset(reg, 0, insn.opnd_bytes);
		memcpy(reg, &val, size);
		break;
	case MMIO_READ_SIGN_EXTEND:
		val = tdg_mmio(size, false, ve->gpa, val);

		/* Sign extend based on operand size */
		if (val & (size == 1 ? 0x80 : 0x8000))
			memset(reg, 0xff, insn.opnd_bytes);
		else
			memset(reg, 0, insn.opnd_bytes);
		memcpy(reg, &val, size);
		break;
	case MMIO_MOVS:
	case MMIO_DECODE_FAILED:
		force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);
		return 0;
	}

	return insn.length;
}

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 29/32] x86/tdx: Add helper to do MapGPA TDVMALL
  2021-05-19 15:59   ` Dave Hansen
@ 2021-05-20 23:14     ` Kuppuswamy, Sathyanarayanan
  2021-05-27  4:56       ` [RFC v2-fix-v1 1/1] x86/tdx: Add helper to do MapGPA hypercall Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-20 23:14 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 5/19/21 8:59 AM, Dave Hansen wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> MapGPA TDVMCALL requests the host VMM to map a GPA range as private or
>> shared memory mappings. Shared GPA mappings can be used for
>> communication beteen TD guest and host VMM, for example for
>> paravirtualized IO.
> 
> As usual, I hate the changelog.  This appears to just be regurgitating
> the spec.
> 
> Is this just for part of converting an existing mapping between private
> and shared?  If so, please say that.
> 

How about following change?

     x86/tdx: Add helper to do MapGPA hypercall

     MapGPA hypercall is used by TDX guests to request VMM convert
     the existing mapping of given GPA address range between
     private/shared.

     tdx_hcall_gpa_intent() is the wrapper used for making MapGPA
     hypercall.


>> The new helper tdx_map_gpa() provides access to the operation.
> 
> <sigh>  You got your own name wrong. It's tdg_map_gpa() in the patch.

I can use tdx_hcall_gpa_intent().

> 
> BTW, I agree with Sean on this one: "tdg" is a horrible prefix.  You
> just proved Sean's point by mistyping it.  *EVERYONE* is going to rpeat
> that mistake: tdg -> tdx.
> 
>> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
>> index dc80cf7f7d08..4789798d7737 100644
>> --- a/arch/x86/include/asm/tdx.h
>> +++ b/arch/x86/include/asm/tdx.h
>> @@ -7,6 +7,11 @@
>>   
>>   #ifndef __ASSEMBLY__
>>   
>> +enum tdx_map_type {
>> +	TDX_MAP_PRIVATE,
>> +	TDX_MAP_SHARED,
>> +};
> 
> I like the enum, but please call out that this is a software construct,
> not a part of any hardware or VMM ABI.
> 
>>   #ifdef CONFIG_INTEL_TDX_GUEST
>>   
>>   #include <asm/cpufeature.h>
>> @@ -112,6 +117,8 @@ unsigned short tdg_inw(unsigned short port);
>>   unsigned int tdg_inl(unsigned short port);
>>   
>>   extern phys_addr_t tdg_shared_mask(void);
>> +extern int tdg_map_gpa(phys_addr_t gpa, int numpages,
>> +		       enum tdx_map_type map_type);
>>   
>>   #else // !CONFIG_INTEL_TDX_GUEST
>>   
>> @@ -155,6 +162,12 @@ static inline phys_addr_t tdg_shared_mask(void)
>>   {
>>   	return 0;
>>   }
>> +
>> +static inline int tdg_map_gpa(phys_addr_t gpa, int numpages,
>> +			      enum tdx_map_type map_type)
>> +{
>> +	return -ENODEV;
>> +}
> 
> FWIW, you could probably get away with just inlining tdg_map_gpa():
> 
> static inline int tdg_map_gpa(phys_addr_t gpa, int numpages, ...
> {
> 	u64 ret;
> 
> 	if (!IS_ENABLED(CONFIG_INTEL_TDX_GUEST))
> 		return -ENODEV;
> 
> 	if (map_type == TDX_MAP_SHARED)
> 		gpa |= tdg_shared_mask();
> 
> 	ret = tdvmcall(TDVMCALL_MAP_GPA, gpa, ...
> 
> 	return ret ? -EIO : 0;
> }
> 
> Then you don't have three copies of the function signature that can get
> out of sync.

I agree that this simplifies the function definition. But, there are
other TDX hypercalls definitions in tdx.c. I can't move all of them to
the header file. If possible, I would like to group all hypercalls in
the same place.

Also, IMO, it is better to hide hypercall internal implementation details
in C file. For example, user of MapGPA hypercall does not care about the
TDVMCALL_MAP_GPA leaf id value. If we inline this function we have to
move such details to header file.


> 
>>   #endif /* CONFIG_INTEL_TDX_GUEST */
>>   #endif /* __ASSEMBLY__ */
>>   #endif /* _ASM_X86_TDX_H */
>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>> index 7e391cd7aa2b..074136473011 100644
>> --- a/arch/x86/kernel/tdx.c
>> +++ b/arch/x86/kernel/tdx.c
>> @@ -15,6 +15,8 @@
>>   #include "tdx-kvm.c"
>>   #endif
>>   
>> +#define TDVMCALL_MAP_GPA	0x10001
>> +
>>   static struct {
>>   	unsigned int gpa_width;
>>   	unsigned long attributes;
>> @@ -98,6 +100,17 @@ static void tdg_get_info(void)
>>   	physical_mask &= ~tdg_shared_mask();
>>   }
>>   
>> +int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
>> +{
>> +	u64 ret;
>> +
>> +	if (map_type == TDX_MAP_SHARED)
>> +		gpa |= tdg_shared_mask();
>> +
>> +	ret = tdvmcall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0);
>> +	return ret ? -EIO : 0;
>> +}
> 
> The naming Intel chose here is nasty.  This doesn't "map" anything.  It
> modifies an existing mapping from what I can tell.  We could name it
> much better than the spec, perhaps:
> 
> 	tdx_hcall_gpa_intent()

I will use this function name in next version.

> 
> BTW, all of these hypercalls need a consistent prefix.

I can include _hcall in other hypercall helper functions as well.

> 
> It also needs a comment:
> 
> 	/*
> 	 * Inform the VMM of the guest's intent for this physical page:
> 	 * shared with the VMM or private to the guest.  The VMM is
> 	 * expected to change its mapping of the page in response.
> 	 *
> 	 * Note: shared->private conversions require further guest
> 	 * action to accept the page.
> 	 */
> 
> The intent here is important.  It makes it clear that this function
> really only plays a role in the conversion process.

Thanks. I will include it in next version.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-05-20 21:28                   ` Kuppuswamy, Sathyanarayanan
@ 2021-05-20 23:25                     ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-20 23:25 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Dave Hansen, Sean Christopherson
  Cc: Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel


On 5/20/2021 2:28 PM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 5/20/21 2:23 PM, Dave Hansen wrote:
>> Sathya has even mis-typed "tdx" instead of "tdg" this in his own
>> changelogs up to this point.  That massively weakens the argument that
>> "tdg" is a good idea.
>
> It is not a typo. But when we did the initial rename from "tdx_" -> 
> "tdg_",
> somehow I missed the change log change. That's why I am bit reluctant
> to go for another rename (since we have scan change log, comments and 
> code)
> in all the patches.


Yes I agree. If there's another rename it should be after a full review 
by all the maintainers. If there is still consensus that a rename is 
needed then it can be done then.

And we'll just hope that Sean's brain will get used to tdg_ by then.


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 1/1] x86/boot: Avoid #VE during boot for TDX platforms
  2021-05-19 16:53       ` [RFC " Dave Hansen
@ 2021-05-21 14:35         ` Kuppuswamy Sathyanarayanan
  2021-05-21 16:11           ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-21 14:35 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson,
	Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Avoid operations which will inject #VE during boot process.
They're easy to avoid and it is less complex than handling
the exceptions.

There are a few MSRs and control register bits which the
kernel normally needs to modify during boot.  But, TDX
disallows modification of these registers to help provide
consistent security guarantees ( and avoid generating #VE
when updating them). Fortunately, TDX ensures that these are
all in the correct state before the kernel loads, which means
the kernel has no need to modify them.

The conditions we need to avoid are:

  * Any writes to the EFER MSR
  * Clearing CR0.NE
  * Clearing CR3.MCE

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-fix:
 * Fixed commit and comments as per Dave and Dan's suggestions.
 * Merged CR0.NE related change in pa_trampoline_compat() from patch
   titled "x86/boot: Add a trampoline for APs booting in 64-bit mode"
   to this patch. It belongs in this patch.
 * Merged TRAMPOLINE_32BIT_CODE_SIZE related change from patch titled
   "x86/boot: Add a trampoline for APs booting in 64-bit mode" to this
   patch (since it was wrongly merged to that patch during patch split).

 arch/x86/boot/compressed/head_64.S   | 16 ++++++++++++----
 arch/x86/boot/compressed/pgtable.h   |  2 +-
 arch/x86/kernel/head_64.S            | 20 ++++++++++++++++++--
 arch/x86/realmode/rm/trampoline_64.S | 23 +++++++++++++++++++----
 4 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index e94874f4bbc1..f848569e3fb0 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -616,12 +616,20 @@ SYM_CODE_START(trampoline_32bit_src)
 	movl	$MSR_EFER, %ecx
 	rdmsr
 	btsl	$_EFER_LME, %eax
+	/* Avoid writing EFER if no change was made (for TDX guest) */
+	jc	1f
 	wrmsr
-	popl	%edx
+1:	popl	%edx
 	popl	%ecx
 
 	/* Enable PAE and LA57 (if required) paging modes */
-	movl	$X86_CR4_PAE, %eax
+	movl	%cr4, %eax
+	/*
+	 * Clear all bits except CR4.MCE, which is preserved.
+	 * Clearing CR4.MCE will #VE in TDX guests.
+	 */
+	andl	$X86_CR4_MCE, %eax
+	orl	$X86_CR4_PAE, %eax
 	testl	%edx, %edx
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
@@ -635,8 +643,8 @@ SYM_CODE_START(trampoline_32bit_src)
 	pushl	$__KERNEL_CS
 	pushl	%eax
 
-	/* Enable paging again */
-	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
+	/* Enable paging again. Avoid clearing X86_CR0_NE for TDX */
+	movl	$(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	lret
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
 
 #define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE	0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE	0x80
 
 #define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
 
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..6cf8d126b80a 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,13 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 1:
 
 	/* Enable PAE mode, PGE and LA57 */
-	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+	movq	%cr4, %rcx
+	/*
+	 * Clear all bits except CR4.MCE, which is preserved.
+	 * Clearing CR4.MCE will #VE in TDX guests.
+	 */
+	andl	$X86_CR4_MCE, %ecx
+	orl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
 #ifdef CONFIG_X86_5LEVEL
 	testl	$1, __pgtable_l5_enabled(%rip)
 	jz	1f
@@ -229,13 +235,23 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	/* Setup EFER (Extended Feature Enable Register) */
 	movl	$MSR_EFER, %ecx
 	rdmsr
+	/*
+	 * Preserve current value of EFER for comparison and to skip
+	 * EFER writes if no change was made (for TDX guest)
+	 */
+	movl    %eax, %edx
 	btsl	$_EFER_SCE, %eax	/* Enable System Call */
 	btl	$20,%edi		/* No Execute supported? */
 	jnc     1f
 	btsl	$_EFER_NX, %eax
 	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
-1:	wrmsr				/* Make changes effective */
 
+	/* Avoid writing EFER if no change was made (for TDX guest) */
+1:	cmpl	%edx, %eax
+	je	1f
+	xor	%edx, %edx
+	wrmsr				/* Make changes effective */
+1:
 	/* Setup cr0 */
 	movl	$CR0_STATE, %eax
 	/* Make changes effective */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 957bb21ce105..cf14d0326a48 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,27 @@ SYM_CODE_START(startup_32)
 	movl	%eax, %cr3
 
 	# Set up EFER
+	movl	$MSR_EFER, %ecx
+	rdmsr
+	/*
+	 * Skip writing to EFER if the register already has desiered
+	 * value (to avoid #VE for TDX guest).
+	 */
+	cmp	pa_tr_efer, %eax
+	jne	.Lwrite_efer
+	cmp	pa_tr_efer + 4, %edx
+	je	.Ldone_efer
+.Lwrite_efer:
 	movl	pa_tr_efer, %eax
 	movl	pa_tr_efer + 4, %edx
-	movl	$MSR_EFER, %ecx
 	wrmsr
 
-	# Enable paging and in turn activate Long Mode
-	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+.Ldone_efer:
+	/*
+	 * Enable paging and in turn activate Long Mode. Avoid clearing
+	 * X86_CR0_NE for TDX.
+	 */
+	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	/*
@@ -169,7 +183,8 @@ SYM_CODE_START(pa_trampoline_compat)
 	movl	$rm_stack_end, %esp
 	movw	$__KERNEL_DS, %dx
 
-	movl	$X86_CR0_PE, %eax
+	/* Avoid clearing X86_CR0_NE for TDX */
+	movl	$(X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 	ljmpl   $__KERNEL32_CS, $pa_startup_32
 SYM_CODE_END(pa_trampoline_compat)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-05-20  0:42                 ` Kuppuswamy, Sathyanarayanan
@ 2021-05-21 14:39                   ` Kuppuswamy Sathyanarayanan
  2021-05-21 18:29                     ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-21 14:39 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, Kuppuswamy Sathyanarayanan, linux-kernel

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a trampoline for booting APs in 64-bit mode via a software handoff
with BIOS, and use the new trampoline for the ACPI MP wake protocol used
by TDX. You can find MADT MP wake protocol details in ACPI specification
r6.4, sec 5.2.12.19.

Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
mode.  For the GDT pointer, create a new entry as the existing storage
for the pointer occupies the zero entry in the GDT itself.

Reported-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-fix:
 * Passed rmh as argument to get_trampoline_start_ip().
 * Added a comment line for get_trampoline_start_ip().
 * Moved X86_CR0_NE change from pa_trampoline_compat() to patch
   "x86/boot: Avoid #VE during boot for TDX platforms".
 * Fixed comments for tr_idt as per Dan's comments.
 * Moved TRAMPOLINE_32BIT_CODE_SIZE change to "x86/boot: Avoid #VE
   during boot for TDX platforms" patch.

 arch/x86/include/asm/realmode.h          | 11 +++++++
 arch/x86/kernel/smpboot.c                |  2 +-
 arch/x86/realmode/rm/header.S            |  1 +
 arch/x86/realmode/rm/trampoline_64.S     | 38 ++++++++++++++++++++++++
 arch/x86/realmode/rm/trampoline_common.S | 12 +++++++-
 5 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 5db5d083c873..0f707521b797 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -25,6 +25,7 @@ struct real_mode_header {
 	u32	sev_es_trampoline_start;
 #endif
 #ifdef CONFIG_X86_64
+	u32	trampoline_start64;
 	u32	trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
@@ -88,6 +89,16 @@ static inline void set_real_mode_mem(phys_addr_t mem)
 	real_mode_header = (struct real_mode_header *) __va(mem);
 }
 
+/* Common helper function to get start IP address */
+static inline unsigned long get_trampoline_start_ip(struct real_mode_header *rmh)
+{
+#ifdef CONFIG_X86_64
+	if (is_tdx_guest())
+		return rmh->trampoline_start64;
+#endif
+	return rmh->trampoline_start;
+}
+
 void reserve_real_mode(void);
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 16703c35a944..659e8d011fe6 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1031,7 +1031,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
 		       int *cpu0_nmi_registered)
 {
 	/* start_ip had better be page-aligned! */
-	unsigned long start_ip = real_mode_header->trampoline_start;
+	unsigned long start_ip = get_trampoline_start_ip(real_mode_header);
 
 	unsigned long boot_error = 0;
 	unsigned long timeout;
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 8c1db5bf5d78..2eb62be6d256 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
 	.long	pa_sev_es_trampoline_start
 #endif
 #ifdef CONFIG_X86_64
+	.long	pa_trampoline_start64
 	.long	pa_trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 84c5d1b33d10..957bb21ce105 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
 	ljmpl	$__KERNEL_CS, $pa_startup_64
 SYM_CODE_END(startup_32)
 
+SYM_CODE_START(pa_trampoline_compat)
+	/*
+	 * In compatibility mode.  Prep ESP and DX for startup_32, then disable
+	 * paging and complete the switch to legacy 32-bit mode.
+	 */
+	movl	$rm_stack_end, %esp
+	movw	$__KERNEL_DS, %dx
+
+	movl	$X86_CR0_PE, %eax
+	movl	%eax, %cr0
+	ljmpl   $__KERNEL32_CS, $pa_startup_32
+SYM_CODE_END(pa_trampoline_compat)
+
 	.section ".text64","ax"
 	.code64
 	.balign 4
@@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
 	jmpq	*tr_start(%rip)
 SYM_CODE_END(startup_64)
 
+SYM_CODE_START(trampoline_start64)
+	/*
+	 * APs start here on a direct transfer from 64-bit BIOS with identity
+	 * mapped page tables.  Load the kernel's GDT in order to gear down to
+	 * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
+	 * segment registers.  Load the zero IDT so any fault triggers a
+	 * shutdown instead of jumping back into BIOS.
+	 */
+	lidt	tr_idt(%rip)
+	lgdt	tr_gdt64(%rip)
+
+	ljmpl	*tr_compat(%rip)
+SYM_CODE_END(trampoline_start64)
+
 	.section ".rodata","a"
 	# Duplicate the global descriptor table
 	# so the kernel can live anywhere
@@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
 	.quad	0x00cf93000000ffff	# __KERNEL_DS
 SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
 
+SYM_DATA_START(tr_gdt64)
+	.short	tr_gdt_end - tr_gdt - 1	# gdt limit
+	.long	pa_tr_gdt
+	.long	0
+SYM_DATA_END(tr_gdt64)
+
+SYM_DATA_START(tr_compat)
+	.long	pa_trampoline_compat
+	.short	__KERNEL32_CS
+SYM_DATA_END(tr_compat)
+
 	.bss
 	.balign	PAGE_SIZE
 SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
index 5033e640f957..4331c32c47f8 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -1,4 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 	.section ".rodata","a"
 	.balign	16
-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+
+/*
+ * When a bootloader hands off to the kernel in 32-bit mode an
+ * IDT with a 2-byte limit and 4-byte base is needed. When a boot
+ * loader hands off to a kernel 64-bit mode the base address
+ * extends to 8-bytes. Reserve enough space for either scenario.
+ */
+SYM_DATA_START_LOCAL(tr_idt)
+	.short  0
+	.quad   0
+SYM_DATA_END(tr_idt)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-20 20:12               ` Kuppuswamy, Sathyanarayanan
@ 2021-05-21 15:18                 ` Borislav Petkov
  2021-05-21 16:19                   ` Tom Lendacky
  0 siblings, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-05-21 15:18 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Sean Christopherson, Dave Hansen, Andi Kleen, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel,
	Brijesh Singh, Tom Lendacky

On Thu, May 20, 2021 at 01:12:58PM -0700, Kuppuswamy, Sathyanarayanan wrote:
> I see many variants of SEV/SME related checks in the common code path
> between TDX and SEV/SME. Can a generic call like
> protected_guest_has(MEMORY_ENCRYPTION) or is_protected_guest()
> replace all these variants?

It depends...

> We will not be able to test AMD related features. So I need to confirm
> it with AMD code maintainers/developers before making this change.

Lemme add two to Cc.

So looking at those examples, you guys are making it not very
suspenceful for TDX - it is the same function in all. :)

> arch/x86/include/asm/io.h:313:	if (sev_key_active() || is_tdx_guest()) {			\
> arch/x86/include/asm/io.h:329:	if (sev_key_active() || is_tdx_guest()) {			\

So I think the static key on the AMD side is not really needed and it
could be replaced with

	sev_active() && !sev_es_active()

i.e. SEV but but not SEV-ES. A vendor-agnostic function would do here
probably something like:

	protected_guest_has(ENC_UNROLL_STRING_IO)

and inside it, it would do:

	if (AMD)
		amd_protected_guest_has(...)
	else if (Intel)
		intel_protected_guest_has(...)
	else
		WARN()

and both vendors would each implement that function with the respective
low-level query functions.

> arch/x86/kernel/pci-swiotlb.c:52:	if (sme_active() || is_tdx_guest())

That can be probably

	protected_guest_has(ENC_HOST_MEM_ENCRYPT);

as on AMD that means SME but not SEV. I guess on Intel you guys want to
do bounce buffers in the guest? or so...

> arch/x86/mm/ioremap.c:96:	if (!sev_active() && !is_tdx_guest())

So that function should simply be replaced with:

        if (!(desc->flags & IORES_MAP_ENCRYPTED)) {
		/* ... comment bla explaining what this is... */
		if ((sev_active() || is_tdx_guest()) &&
		    (res->desc != IORES_DESC_NONE &&
		     res->desc != IORES_DESC_RESERVED))
				desc->flags |= IORES_MAP_ENCRYPTED;
	}

as to the first check I guess:

	protected_guest_has(ENC_GUEST_ENABLED)

or so to mean, kernel is running as an encrypted guest...

> arch/x86/mm/pat/set_memory.c:1984:	if (!mem_encrypt_active() && !is_tdx_guest())

That should probably be

	protected_guest_has(ENC_ACTIVE);

to denote the generic "I'm running some sort of memory encryption..."

Yeah, this is all rough and should show the main idea - to have a
vendor-agnostic accessor in such common code paths and then abstract
away the differences in cpu/amd.c and cpu/intel.c, respectively and thus
keep the code sane.

How does that sound?

ENC_ being an ENCryption prefix, ofc.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86/boot: Avoid #VE during boot for TDX platforms
  2021-05-21 14:35         ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
@ 2021-05-21 16:11           ` Dave Hansen
  2021-05-21 18:18             ` Sean Christopherson
  2021-05-21 18:31             ` [RFC v2-fix-v2 " Kuppuswamy, Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-21 16:11 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson

> Avoid operations which will inject #VE during boot process.
> They're easy to avoid and it is less complex than handling
> the exceptions.

This puts the solution before the problem.  I'd also make sure to
clearly connect this solution to the problem.  For instance, if you
refer to register "modification", ensure that you reflect that language
here.  Don't call them "modifications" in one part of the changelog and
"operations" here.  I'd also qualify them as "superfluous".

Please reorder this in the following form:

1. Background
2. Problem
3. Solution

Please do this for all of your patches.

> There are a few MSRs and control register bits which the
> kernel normally needs to modify during boot.  But, TDX
> disallows modification of these registers to help provide
> consistent security guarantees ( and avoid generating #VE
> when updating them).

No, the TDX architecture does not avoid generating #VE.  The *kernel*
does that.  This sentence conflates those two things.

> Fortunately, TDX ensures that these are
> all in the correct state before the kernel loads, which means
> the kernel has no need to modify them.
> 
> The conditions we need to avoid are:
> 
>   * Any writes to the EFER MSR
>   * Clearing CR0.NE
>   * Clearing CR3.MCE

Sathya, there have been repeated issues in your changelogs with "we's".
 Remember, speak in imperative voice.  Please fix this in your tooling
to find these so that reviewers don't have to.

> +	/*
> +	 * Preserve current value of EFER for comparison and to skip
> +	 * EFER writes if no change was made (for TDX guest)
> +	 */
> +	movl    %eax, %edx
>  	btsl	$_EFER_SCE, %eax	/* Enable System Call */
>  	btl	$20,%edi		/* No Execute supported? */
>  	jnc     1f
>  	btsl	$_EFER_NX, %eax
>  	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
> -1:	wrmsr				/* Make changes effective */
>  
> +	/* Avoid writing EFER if no change was made (for TDX guest) */
> +1:	cmpl	%edx, %eax
> +	je	1f
> +	xor	%edx, %edx
> +	wrmsr				/* Make changes effective */
> +1:

Just curious, but what if this goes wrong?  Say the TDX firmware didn't
set up EFER correctly and this code does the WRMSR.  What ends up
happening?  Do we get anything out on the console, or is it essentially
undebuggable?

> 
> +	/*
> +	 * Skip writing to EFER if the register already has desiered
> +	 * value (to avoid #VE for TDX guest).
> +	 */


							spelling ^

There are lots of editors that can do spell checking, even in C
comments.  You might want to look into that for your editor.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-21 15:18                 ` Borislav Petkov
@ 2021-05-21 16:19                   ` Tom Lendacky
  2021-05-21 18:49                     ` Borislav Petkov
  2021-05-26 21:37                     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Tom Lendacky @ 2021-05-21 16:19 UTC (permalink / raw)
  To: Borislav Petkov, Kuppuswamy, Sathyanarayanan
  Cc: Sean Christopherson, Dave Hansen, Andi Kleen, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel,
	Brijesh Singh

On 5/21/21 10:18 AM, Borislav Petkov wrote:
> On Thu, May 20, 2021 at 01:12:58PM -0700, Kuppuswamy, Sathyanarayanan wrote:
>> I see many variants of SEV/SME related checks in the common code path
>> between TDX and SEV/SME. Can a generic call like
>> protected_guest_has(MEMORY_ENCRYPTION) or is_protected_guest()
>> replace all these variants?
> 
> It depends...
> 
>> We will not be able to test AMD related features. So I need to confirm
>> it with AMD code maintainers/developers before making this change.
> 
> Lemme add two to Cc.
> 
> So looking at those examples, you guys are making it not very
> suspenceful for TDX - it is the same function in all. :)
> 
>> arch/x86/include/asm/io.h:313:	if (sev_key_active() || is_tdx_guest()) {			\
>> arch/x86/include/asm/io.h:329:	if (sev_key_active() || is_tdx_guest()) {			\
> 
> So I think the static key on the AMD side is not really needed and it
> could be replaced with
> 
> 	sev_active() && !sev_es_active()
> 
> i.e. SEV but but not SEV-ES. A vendor-agnostic function would do here
> probably something like:
> 
> 	protected_guest_has(ENC_UNROLL_STRING_IO)
> 
> and inside it, it would do:
> 
> 	if (AMD)
> 		amd_protected_guest_has(...)
> 	else if (Intel)
> 		intel_protected_guest_has(...)
> 	else
> 		WARN()
> 
> and both vendors would each implement that function with the respective
> low-level query functions.
> 
>> arch/x86/kernel/pci-swiotlb.c:52:	if (sme_active() || is_tdx_guest())
> 
> That can be probably
> 
> 	protected_guest_has(ENC_HOST_MEM_ENCRYPT);
> 
> as on AMD that means SME but not SEV. I guess on Intel you guys want to
> do bounce buffers in the guest? or so...

In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
when SEV support was added), we do:
	if (sev_active())
		swiotlb_force = SWIOTLB_FORCE;

TDX should be able to do a similar thing without having to touch
arch/x86/kernel/pci-swiotlb.c.

That would remove any confusion over SME being part of a
protected_guest_has() call.

> 
>> arch/x86/mm/ioremap.c:96:	if (!sev_active() && !is_tdx_guest())
> 
> So that function should simply be replaced with:
> 
>         if (!(desc->flags & IORES_MAP_ENCRYPTED)) {
> 		/* ... comment bla explaining what this is... */
> 		if ((sev_active() || is_tdx_guest()) &&
> 		    (res->desc != IORES_DESC_NONE &&
> 		     res->desc != IORES_DESC_RESERVED))
> 				desc->flags |= IORES_MAP_ENCRYPTED;
> 	}

I kinda like the separate function, though.

> 
> as to the first check I guess:
> 
> 	protected_guest_has(ENC_GUEST_ENABLED)
> 
> or so to mean, kernel is running as an encrypted guest...
> 
>> arch/x86/mm/pat/set_memory.c:1984:	if (!mem_encrypt_active() && !is_tdx_guest())
> 
> That should probably be
> 
> 	protected_guest_has(ENC_ACTIVE);
> 
> to denote the generic "I'm running some sort of memory encryption..."

Except mem_encrypt_active() covers both SME and SEV, so
protected_guest_has() would be confusing.

Thanks,
Tom

> 
> Yeah, this is all rough and should show the main idea - to have a
> vendor-agnostic accessor in such common code paths and then abstract
> away the differences in cpu/amd.c and cpu/intel.c, respectively and thus
> keep the code sane.
> 
> How does that sound?
> 
> ENC_ being an ENCryption prefix, ofc.
> 

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86/boot: Avoid #VE during boot for TDX platforms
  2021-05-21 16:11           ` Dave Hansen
@ 2021-05-21 18:18             ` Sean Christopherson
  2021-05-21 18:30               ` Dave Hansen
  2021-05-21 18:31             ` [RFC v2-fix-v2 " Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-21 18:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel, Sean Christopherson

On Fri, May 21, 2021, Dave Hansen wrote:
> > +	/*
> > +	 * Preserve current value of EFER for comparison and to skip
> > +	 * EFER writes if no change was made (for TDX guest)
> > +	 */
> > +	movl    %eax, %edx
> >  	btsl	$_EFER_SCE, %eax	/* Enable System Call */
> >  	btl	$20,%edi		/* No Execute supported? */
> >  	jnc     1f
> >  	btsl	$_EFER_NX, %eax
> >  	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
> > -1:	wrmsr				/* Make changes effective */
> >  
> > +	/* Avoid writing EFER if no change was made (for TDX guest) */
> > +1:	cmpl	%edx, %eax
> > +	je	1f
> > +	xor	%edx, %edx
> > +	wrmsr				/* Make changes effective */
> > +1:
> 
> Just curious, but what if this goes wrong?  Say the TDX firmware didn't
> set up EFER correctly and this code does the WRMSR.

By firmware, do you mean TDX-module, or guest firmware?  EFER is read-only in a
TDX guest, i.e. the guest firmware can't change it either.

> What ends up happening?  Do we get anything out on the console, or is it
> essentially undebuggable?

Assuming "firmware" means TDX-module, if TDX-Module botches EFER (and only EFER)
then odds are very, very good that the guest will never get to the kernel as it
will have died long before in guest BIOS.

If the bug is such that EFER is correct in hardware, but RDMSR returns the wrong
value (due to MSR interception), IIRC this will triple fault and so nothing will
get logged.  But, the odds of that type of bug being hit in production are
practically zero because the EFER setup is very static, i.e. any such bug should
be hit during qualification of the VMM+TDX-Module.

In any case, even if a bug escapes, the shutdown is relatively easy to debug even
without logs because the failure will cleary point at the WRMSR (that info can be
had by running a debug TD or a debug TDX-Module).  By TDX standards, debugging
shutdowns on a specific instruction is downright trivial :-).

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-05-21 14:39                   ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
@ 2021-05-21 18:29                     ` Dan Williams
  0 siblings, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-21 18:29 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Fri, May 21, 2021 at 7:40 AM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Add a trampoline for booting APs in 64-bit mode via a software handoff
> with BIOS, and use the new trampoline for the ACPI MP wake protocol used
> by TDX. You can find MADT MP wake protocol details in ACPI specification
> r6.4, sec 5.2.12.19.
>
> Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
> mode.  For the GDT pointer, create a new entry as the existing storage
> for the pointer occupies the zero entry in the GDT itself.
>
> Reported-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>
> Changes since RFC v2-fix:
>  * Passed rmh as argument to get_trampoline_start_ip().
>  * Added a comment line for get_trampoline_start_ip().
>  * Moved X86_CR0_NE change from pa_trampoline_compat() to patch
>    "x86/boot: Avoid #VE during boot for TDX platforms".
>  * Fixed comments for tr_idt as per Dan's comments.
>  * Moved TRAMPOLINE_32BIT_CODE_SIZE change to "x86/boot: Avoid #VE
>    during boot for TDX platforms" patch.

Thanks, looks good, no more comments from me:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86/boot: Avoid #VE during boot for TDX platforms
  2021-05-21 18:18             ` Sean Christopherson
@ 2021-05-21 18:30               ` Dave Hansen
  2021-05-21 18:32                 ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-21 18:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel, Sean Christopherson

On 5/21/21 11:18 AM, Sean Christopherson wrote:
> On Fri, May 21, 2021, Dave Hansen wrote:
>>> +	/*
>>> +	 * Preserve current value of EFER for comparison and to skip
>>> +	 * EFER writes if no change was made (for TDX guest)
>>> +	 */
>>> +	movl    %eax, %edx
>>>  	btsl	$_EFER_SCE, %eax	/* Enable System Call */
>>>  	btl	$20,%edi		/* No Execute supported? */
>>>  	jnc     1f
>>>  	btsl	$_EFER_NX, %eax
>>>  	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
>>> -1:	wrmsr				/* Make changes effective */
>>>  
>>> +	/* Avoid writing EFER if no change was made (for TDX guest) */
>>> +1:	cmpl	%edx, %eax
>>> +	je	1f
>>> +	xor	%edx, %edx
>>> +	wrmsr				/* Make changes effective */
>>> +1:
>>
>> Just curious, but what if this goes wrong?  Say the TDX firmware didn't
>> set up EFER correctly and this code does the WRMSR.
> 
> By firmware, do you mean TDX-module, or guest firmware?  EFER is read-only in a
> TDX guest, i.e. the guest firmware can't change it either.

I guess I was assuming that the trusted BIOS was going to do the setup
of EFER before it hands control over to the kernel.  So, I *meant* the BIOS.

But, I see from below that it's probably the TDX-module that's
responsible for this behavior.

>> What ends up happening?  Do we get anything out on the console, or is it
>> essentially undebuggable?
> 
> Assuming "firmware" means TDX-module, if TDX-Module botches EFER (and only EFER)
> then odds are very, very good that the guest will never get to the kernel as it
> will have died long before in guest BIOS.
> 
> If the bug is such that EFER is correct in hardware, but RDMSR returns the wrong
> value (due to MSR interception), IIRC this will triple fault and so nothing will
> get logged.  But, the odds of that type of bug being hit in production are
> practically zero because the EFER setup is very static, i.e. any such bug should
> be hit during qualification of the VMM+TDX-Module.
> 
> In any case, even if a bug escapes, the shutdown is relatively easy to debug even
> without logs because the failure will cleary point at the WRMSR (that info can be
> had by running a debug TD or a debug TDX-Module).  By TDX standards, debugging
> shutdowns on a specific instruction is downright trivial :-).

That sounds sane to me.  It would be nice to get this into the
changelog.  Perhaps:

	This theoretically makes guest boot more fragile.  If, for
	instance, EER was set up incorrectly and a WRMSR was performed,
	the resulting (unhandled) #VE would triple fault.  However, this
	is likely to trip up the guest BIOS long before control reaches
	the kernel.  In any case, these kinds of problems are unlikely
	to occur in production environments, and developers have good
	debug tools to fix them quickly.

That would put my mind at ease a bit.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86/boot: Avoid #VE during boot for TDX platforms
  2021-05-21 16:11           ` Dave Hansen
  2021-05-21 18:18             ` Sean Christopherson
@ 2021-05-21 18:31             ` Kuppuswamy, Sathyanarayanan
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-21 18:31 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson



On 5/21/21 9:11 AM, Dave Hansen wrote:
>> Avoid operations which will inject #VE during boot process.
>> They're easy to avoid and it is less complex than handling
>> the exceptions.
> 
> This puts the solution before the problem.  I'd also make sure to
> clearly connect this solution to the problem.  For instance, if you
> refer to register "modification", ensure that you reflect that language
> here.  Don't call them "modifications" in one part of the changelog and
> "operations" here.  I'd also qualify them as "superfluous".
> 
> Please reorder this in the following form:
> 
> 1. Background
> 2. Problem
> 3. Solution
> 
> Please do this for all of your patches.
> 
>> There are a few MSRs and control register bits which the
>> kernel normally needs to modify during boot.  But, TDX
>> disallows modification of these registers to help provide
>> consistent security guarantees ( and avoid generating #VE
>> when updating them).
> 
> No, the TDX architecture does not avoid generating #VE.  The *kernel*
> does that.  This sentence conflates those two things.
> 
>> Fortunately, TDX ensures that these are
>> all in the correct state before the kernel loads, which means
>> the kernel has no need to modify them.
>>
>> The conditions we need to avoid are:
>>
>>    * Any writes to the EFER MSR
>>    * Clearing CR0.NE
>>    * Clearing CR3.MCE
> 
> Sathya, there have been repeated issues in your changelogs with "we's".
>   Remember, speak in imperative voice.  Please fix this in your tooling
> to find these so that reviewers don't have to.

How about the following commit log?

In TDX guests, Virtualization Exceptions (#VE) are delivered
to TDX guests due to specific guest actions like MSR writes,
CPUID leaf accesses or I/O access. But in early boot code, #VE
cannot be allowed because the required exception handler setup
support code is missing. If #VE is triggered without proper
handler support, it would lead to triple fault or kernel hang.
So, avoid operations which will inject #VE during boot process.
They're easy to avoid and it is less complex than handling the
exceptions.

There are a few MSRs and control register bits which the kernel
normally needs to modify during boot. But, TDX disallows
modification of these registers to help provide consistent
security guarantees. Fortunately, TDX ensures that these are all
in the correct state before the kernel loads, which means the
kernel has no need to modify them.

The conditions to avoid are:

   * Any writes to the EFER MSR
   * Clearing CR0.NE
   * Clearing CR3.MCE

If above conditions are not avoided, it would lead to triple
fault or kernel hang.

> 
>> +	/*
>> +	 * Preserve current value of EFER for comparison and to skip
>> +	 * EFER writes if no change was made (for TDX guest)
>> +	 */
>> +	movl    %eax, %edx
>>   	btsl	$_EFER_SCE, %eax	/* Enable System Call */
>>   	btl	$20,%edi		/* No Execute supported? */
>>   	jnc     1f
>>   	btsl	$_EFER_NX, %eax
>>   	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
>> -1:	wrmsr				/* Make changes effective */
>>   
>> +	/* Avoid writing EFER if no change was made (for TDX guest) */
>> +1:	cmpl	%edx, %eax
>> +	je	1f
>> +	xor	%edx, %edx
>> +	wrmsr				/* Make changes effective */
>> +1:
> 
> Just curious, but what if this goes wrong?  Say the TDX firmware didn't
> set up EFER correctly and this code does the WRMSR.  What ends up
> happening? 

It would lead to triple fault.

  Do we get anything out on the console, or is it essentially
> undebuggable?
> 

We can still get logs with debug TDX module. So it is still debugable.

>>
>> +	/*
>> +	 * Skip writing to EFER if the register already has desiered
>> +	 * value (to avoid #VE for TDX guest).
>> +	 */
> 
> 
> 							spelling ^
> 
> There are lots of editors that can do spell checking, even in C
> comments.  You might want to look into that for your editor.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86/boot: Avoid #VE during boot for TDX platforms
  2021-05-21 18:30               ` Dave Hansen
@ 2021-05-21 18:32                 ` Kuppuswamy, Sathyanarayanan
  2021-05-24 23:27                   ` [RFC v2-fix-v3 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-21 18:32 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson
  Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, linux-kernel, Sean Christopherson



On 5/21/21 11:30 AM, Dave Hansen wrote:
> That sounds sane to me.  It would be nice to get this into the
> changelog.  Perhaps:
> 
> 	This theoretically makes guest boot more fragile.  If, for
> 	instance, EER was set up incorrectly and a WRMSR was performed,
> 	the resulting (unhandled) #VE would triple fault.  However, this
> 	is likely to trip up the guest BIOS long before control reaches
> 	the kernel.  In any case, these kinds of problems are unlikely
> 	to occur in production environments, and developers have good
> 	debug tools to fix them quickly.
> 
> That would put my mind at ease a bit.

I can add it to change log.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-18  0:09     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
  2021-05-18 15:11       ` Dave Hansen
@ 2021-05-21 18:45       ` Kuppuswamy, Sathyanarayanan
  2021-05-21 19:15         ` Dave Hansen
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-21 18:45 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson

Hi Dave,

On 5/17/21 5:09 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov"<kirill.shutemov@linux.intel.com>
> 
> Virtualization Exceptions (#VE) are delivered to TDX guests due to
> specific guest actions which may happen in either user space or the kernel:
> 
>   * Specific instructions (WBINVD, for example)
>   * Specific MSR accesses
>   * Specific CPUID leaf accesses
>   * Access to TD-shared memory, which includes MMIO
> 
> In the settings that Linux will run in, virtual exceptions are never
> generated on accesses to normal, TD-private memory that has been
> accepted.
> 
> The entry paths do not access TD-shared memory, MMIO regions or use
> those specific MSRs, instructions, CPUID leaves that might generate #VE.
> In addition, all interrupts including NMIs are blocked by the hardware
> starting with #VE delivery until TDGETVEINFO is called.  This eliminates
> the chance of a #VE during the syscall gap or paranoid entry paths and
> simplifies #VE handling.
> 
> After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
> although we don't expect it to happen because we don't expect NMIs to
> trigger #VEs. Another case where they could happen is if the #VE
> exception panics, but in this case there are no guarantees on anything
> anyways.
> 
> If a guest kernel action which would normally cause a #VE occurs in the
> interrupt-disabled region before TDGETVEINFO, a #DF is delivered to the
> guest which will result in an oops (and should eventually be a panic, as
> we would like to set panic_on_oops to 1 for TDX guests).
> 
> Add basic infrastructure to handle any #VE which occurs in the kernel or
> userspace.  Later patches will add handling for specific #VE scenarios.
> 
> Convert unhandled #VE's (everything, until later in this series) so that
> they appear just like a #GP by calling ve_raise_fault() directly.
> ve_raise_fault() is similar to #GP handler and is responsible for
> sending SIGSEGV to userspace and cpu die and notifying debuggers and
> other die chain users.
> 
> Co-developed-by: Sean Christopherson<sean.j.christopherson@intel.com>
> Signed-off-by: Sean Christopherson<sean.j.christopherson@intel.com>
> Signed-off-by: Kirill A. Shutemov<kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen<ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan<sathyanarayanan.kuppuswamy@linux.intel.com>
> ---

You have any other comments on this patch? If not, can you reply with your
Reviewed-by tag?

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-21 16:19                   ` Tom Lendacky
@ 2021-05-21 18:49                     ` Borislav Petkov
  2021-05-21 21:14                       ` Tom Lendacky
  2021-05-26 21:37                     ` Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-05-21 18:49 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Kuppuswamy, Sathyanarayanan, Sean Christopherson, Dave Hansen,
	Andi Kleen, Peter Zijlstra, Andy Lutomirski, Dan Williams,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, linux-kernel, Brijesh Singh

On Fri, May 21, 2021 at 11:19:15AM -0500, Tom Lendacky wrote:
> In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
> when SEV support was added), we do:
> 	if (sev_active())
> 		swiotlb_force = SWIOTLB_FORCE;
> 
> TDX should be able to do a similar thing without having to touch
> arch/x86/kernel/pci-swiotlb.c.
> 
> That would remove any confusion over SME being part of a
> protected_guest_has() call.

Even better.

> I kinda like the separate function, though.

Only if you clean it up and get rid of the inverted logic and drop that
silly switch-case.

> Except mem_encrypt_active() covers both SME and SEV, so
> protected_guest_has() would be confusing.

I don't understand - the AMD-specific function amd_protected_guest_has()
would return sme_me_mask just like mem_encrypt_active() does and we can
get rid of latter.

Or do you have a problem with the name protected_guest_has() containing
"guest" while we're talking about SME here?

If so, feel free to suggest a better one - the name does not have to
have "guest" in it.

Thx.


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-21 18:45       ` [RFC v2-fix " Kuppuswamy, Sathyanarayanan
@ 2021-05-21 19:15         ` Dave Hansen
  2021-05-21 19:57           ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-21 19:15 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson

On 5/21/21 11:45 AM, Kuppuswamy, Sathyanarayanan wrote:
> You have any other comments on this patch? If not, can you reply with your
> Reviewed-by tag?

Sathya, I've been rather busy with your own patches and your colleagues
TDX patches.  I've clearly communicated to you which patches I plan to
provide a review for.  I'll get to them, although not quite at the speed
you would like.

If you would like to get a quicker review, I'd highly suggest you go
find some of your TDX colleagues' code that needs its quality improved
and help by providing them reviews.  Reviews are a two-way street, not
just a service provided by maintainers to contributors.

You could also make good use of your time by going back over all of the
review comments I've made up to this point and doing a pass over your
work to ensure that I don't have to continue to repeat myself and waste
review efforts.  You could add a spell checker to your workflow, or
scripting to check for language conventions like avoiding "us" and "we".
 You could also seek out help to raise the quality of your
communications.  It isn't just reviewers that can help raise the quality
of your contributions.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-18 15:45         ` Andi Kleen
  2021-05-18 15:56           ` Dave Hansen
@ 2021-05-21 19:22           ` Dan Williams
  2021-05-24 14:02             ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-21 19:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List, Sean Christopherson

On Tue, May 18, 2021 at 8:45 AM Andi Kleen <ak@linux.intel.com> wrote:
>
>
> On 5/18/2021 8:11 AM, Dave Hansen wrote:
> > On 5/17/21 5:09 PM, Kuppuswamy Sathyanarayanan wrote:
> >> After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
> >> although we don't expect it to happen because we don't expect NMIs to
> >> trigger #VEs. Another case where they could happen is if the #VE
> >> exception panics, but in this case there are no guarantees on anything
> >> anyways.
> > This implies: "we do not expect any NMI to do MMIO".  Is that true?  Why?
>
> Only drivers that are not supported in TDX anyways could do it (mainly
> watchdog drivers)

What about apei_{read,write}() for ACPI error handling? Those are
called in NMI to do MMIO accesses. It's not just watchdog drivers.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-21 19:15         ` Dave Hansen
@ 2021-05-21 19:57           ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-21 19:57 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson



On 5/21/21 12:15 PM, Dave Hansen wrote:
> On 5/21/21 11:45 AM, Kuppuswamy, Sathyanarayanan wrote:
>> You have any other comments on this patch? If not, can you reply with your
>> Reviewed-by tag?
> 
> Sathya, I've been rather busy with your own patches and your colleagues
> TDX patches.  I've clearly communicated to you which patches I plan to
> provide a review for.  I'll get to them, although not quite at the speed
> you would like.
> 

My impression so far is, for TDX patch submissions, you usually reply to
the patch submission/comments in 1-2 days (sorry if this assumption is
incorrect). Since I did not see any major objections for this patch, I
was just checking with you to understand if this patch review is pending
due to something missing from my end. My intention was not to rush you,
but just to understand if it needs some work from my end.

Sorry if the reminder emails trouble you. Since we are aiming for v5.14
merge window, I am trying to avoid any delays from my end.

> If you would like to get a quicker review, I'd highly suggest you go
> find some of your TDX colleagues' code that needs its quality improved
> and help by providing them reviews.  Reviews are a two-way street, not
> just a service provided by maintainers to contributors.
> 
> You could also make good use of your time by going back over all of the
> review comments I've made up to this point and doing a pass over your
> work to ensure that I don't have to continue to repeat myself and waste
> review efforts.

I have considered your comments and fixed the common issues reported by
you in this patch-set. But when addressing recent comments and while
updating the commit log, some of these issues got introduced again. I will
try to avoid them in future.



-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-21 18:49                     ` Borislav Petkov
@ 2021-05-21 21:14                       ` Tom Lendacky
  2021-05-25 18:21                         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Tom Lendacky @ 2021-05-21 21:14 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kuppuswamy, Sathyanarayanan, Sean Christopherson, Dave Hansen,
	Andi Kleen, Peter Zijlstra, Andy Lutomirski, Dan Williams,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, linux-kernel, Brijesh Singh



On 5/21/21 1:49 PM, Borislav Petkov wrote:
> On Fri, May 21, 2021 at 11:19:15AM -0500, Tom Lendacky wrote:
>> In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
>> when SEV support was added), we do:
>> 	if (sev_active())
>> 		swiotlb_force = SWIOTLB_FORCE;
>>
>> TDX should be able to do a similar thing without having to touch
>> arch/x86/kernel/pci-swiotlb.c.
>>
>> That would remove any confusion over SME being part of a
>> protected_guest_has() call.
> 
> Even better.
> 
>> I kinda like the separate function, though.
> 
> Only if you clean it up and get rid of the inverted logic and drop that
> silly switch-case.
> 
>> Except mem_encrypt_active() covers both SME and SEV, so
>> protected_guest_has() would be confusing.
> 
> I don't understand - the AMD-specific function amd_protected_guest_has()
> would return sme_me_mask just like mem_encrypt_active() does and we can
> get rid of latter.
> 
> Or do you have a problem with the name protected_guest_has() containing
> "guest" while we're talking about SME here?

The latter.

> 
> If so, feel free to suggest a better one - the name does not have to
> have "guest" in it.

Let me see if I can come up with something that will make sense.

Thanks,
Tom

> 
> Thx.
> 
> 

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-21 19:22           ` Dan Williams
@ 2021-05-24 14:02             ` Andi Kleen
  2021-05-27  0:29               ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-24 14:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List, Sean Christopherson


>> Only drivers that are not supported in TDX anyways could do it (mainly
>> watchdog drivers)
> What about apei_{read,write}() for ACPI error handling? Those are
> called in NMI to do MMIO accesses. It's not just watchdog drivers.

We expect the APEI stuff to be filtered in the normal case to reduce the 
attack surface. There's no use case for APEI error reporting in a 
normally operating TDX guest.

But yes that's why I wrote mainly. It should work in any case, we fully 
support #VE nesting after TDVEREPORT.

-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v3 1/1] x86/boot: Avoid #VE during boot for TDX platforms
  2021-05-21 18:32                 ` Kuppuswamy, Sathyanarayanan
@ 2021-05-24 23:27                   ` Kuppuswamy Sathyanarayanan
  2021-05-27 21:25                     ` [RFC v2-fix-v4 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-24 23:27 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, Kuppuswamy Sathyanarayanan, linux-kernel

From: Sean Christopherson <sean.j.christopherson@intel.com>

In TDX guests, Virtualization Exceptions (#VE) are delivered
to TDX guests due to specific guest actions like MSR writes,
CPUID leaf accesses or I/O access. But in early boot code, #VE
cannot be allowed because the required exception handler setup
support code is missing. If #VE is triggered without proper
handler support, it would lead to triple fault or kernel hang.
So, avoid operations which will inject #VE during boot process.
They're easy to avoid and it is less complex than handling the
exceptions.

There are a few MSRs and control register bits which the kernel
normally needs to modify during boot. But, TDX disallows
modification of these registers to help provide consistent
security guarantees. Fortunately, TDX ensures that these are all
in the correct state before the kernel loads, which means the
kernel has no need to modify them.

The conditions to avoid are:

  * Any writes to the EFER MSR
  * Clearing CR0.NE
  * Clearing CR3.MCE

This theoretically makes guest boot more fragile. If, for
instance, EFER was set up incorrectly and a WRMSR was performed,
the resulting (unhandled) #VE would triple fault. However, this
is likely to trip up the guest BIOS long before control reaches
the kernel. In any case, these kinds of problems are unlikely to
occur in production environments, and developers have good debug
tools to fix them quickly. 

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
Changes since RFC v2-fix-v2:
 * Fixed commit log as per review comments.

Changes since RFC v2-fix:
 * Fixed commit and comments as per Dave and Dan's suggestions.
 * Merged CR0.NE related change in pa_trampoline_compat() from patch
   titled "x86/boot: Add a trampoline for APs booting in 64-bit mode"
   to this patch. It belongs in this patch.
 * Merged TRAMPOLINE_32BIT_CODE_SIZE related change from patch titled
   "x86/boot: Add a trampoline for APs booting in 64-bit mode" to this
   patch (since it was wrongly merged to that patch during patch split).

 arch/x86/boot/compressed/head_64.S   | 16 ++++++++++++----
 arch/x86/boot/compressed/pgtable.h   |  2 +-
 arch/x86/kernel/head_64.S            | 20 ++++++++++++++++++--
 arch/x86/realmode/rm/trampoline_64.S | 23 +++++++++++++++++++----
 4 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index e94874f4bbc1..f848569e3fb0 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -616,12 +616,20 @@ SYM_CODE_START(trampoline_32bit_src)
 	movl	$MSR_EFER, %ecx
 	rdmsr
 	btsl	$_EFER_LME, %eax
+	/* Avoid writing EFER if no change was made (for TDX guest) */
+	jc	1f
 	wrmsr
-	popl	%edx
+1:	popl	%edx
 	popl	%ecx
 
 	/* Enable PAE and LA57 (if required) paging modes */
-	movl	$X86_CR4_PAE, %eax
+	movl	%cr4, %eax
+	/*
+	 * Clear all bits except CR4.MCE, which is preserved.
+	 * Clearing CR4.MCE will #VE in TDX guests.
+	 */
+	andl	$X86_CR4_MCE, %eax
+	orl	$X86_CR4_PAE, %eax
 	testl	%edx, %edx
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
@@ -635,8 +643,8 @@ SYM_CODE_START(trampoline_32bit_src)
 	pushl	$__KERNEL_CS
 	pushl	%eax
 
-	/* Enable paging again */
-	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
+	/* Enable paging again. Avoid clearing X86_CR0_NE for TDX */
+	movl	$(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	lret
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
 
 #define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE	0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE	0x80
 
 #define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
 
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..6cf8d126b80a 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,13 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 1:
 
 	/* Enable PAE mode, PGE and LA57 */
-	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+	movq	%cr4, %rcx
+	/*
+	 * Clear all bits except CR4.MCE, which is preserved.
+	 * Clearing CR4.MCE will #VE in TDX guests.
+	 */
+	andl	$X86_CR4_MCE, %ecx
+	orl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
 #ifdef CONFIG_X86_5LEVEL
 	testl	$1, __pgtable_l5_enabled(%rip)
 	jz	1f
@@ -229,13 +235,23 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	/* Setup EFER (Extended Feature Enable Register) */
 	movl	$MSR_EFER, %ecx
 	rdmsr
+	/*
+	 * Preserve current value of EFER for comparison and to skip
+	 * EFER writes if no change was made (for TDX guest)
+	 */
+	movl    %eax, %edx
 	btsl	$_EFER_SCE, %eax	/* Enable System Call */
 	btl	$20,%edi		/* No Execute supported? */
 	jnc     1f
 	btsl	$_EFER_NX, %eax
 	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
-1:	wrmsr				/* Make changes effective */
 
+	/* Avoid writing EFER if no change was made (for TDX guest) */
+1:	cmpl	%edx, %eax
+	je	1f
+	xor	%edx, %edx
+	wrmsr				/* Make changes effective */
+1:
 	/* Setup cr0 */
 	movl	$CR0_STATE, %eax
 	/* Make changes effective */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 957bb21ce105..cf14d0326a48 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,27 @@ SYM_CODE_START(startup_32)
 	movl	%eax, %cr3
 
 	# Set up EFER
+	movl	$MSR_EFER, %ecx
+	rdmsr
+	/*
+	 * Skip writing to EFER if the register already has desiered
+	 * value (to avoid #VE for TDX guest).
+	 */
+	cmp	pa_tr_efer, %eax
+	jne	.Lwrite_efer
+	cmp	pa_tr_efer + 4, %edx
+	je	.Ldone_efer
+.Lwrite_efer:
 	movl	pa_tr_efer, %eax
 	movl	pa_tr_efer + 4, %edx
-	movl	$MSR_EFER, %ecx
 	wrmsr
 
-	# Enable paging and in turn activate Long Mode
-	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+.Ldone_efer:
+	/*
+	 * Enable paging and in turn activate Long Mode. Avoid clearing
+	 * X86_CR0_NE for TDX.
+	 */
+	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	/*
@@ -169,7 +183,8 @@ SYM_CODE_START(pa_trampoline_compat)
 	movl	$rm_stack_end, %esp
 	movw	$__KERNEL_DS, %dx
 
-	movl	$X86_CR0_PE, %eax
+	/* Avoid clearing X86_CR0_NE for TDX */
+	movl	$(X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 	ljmpl   $__KERNEL32_CS, $pa_startup_32
 SYM_CODE_END(pa_trampoline_compat)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 1/1] x86/tdx: ioapic: Add shared bit for IOAPIC base address
  2021-05-07 23:06   ` Dave Hansen
@ 2021-05-24 23:29     ` Kuppuswamy Sathyanarayanan
  2021-06-01  1:28       ` [RFC v2-fix-v3 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-24 23:29 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, Kuppuswamy Sathyanarayanan, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

The kernel interacts with each bare-metal IOAPIC with a special
MMIO page. When running under KVM, the guest's IOAPICs are
emulated by KVM.

When running as a TDX guest, the guest needs to mark each IOAPIC
mapping as "shared" with the host.  This ensures that TDX private
protections are not applied to the page, which allows the TDX host
emulation to work.

Earlier patches in this series modified ioremap() so that
ioremap()-created mappings such as virtio will be marked as
shared. However, the IOAPIC code does not use ioremap() and instead
uses the fixmap mechanism.

Introduce a special fixmap helper just for the IOAPIC code.  Ensure
that it marks IOAPIC pages as "shared".  This replaces
set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
allows custom 'prot' values.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2:
 * Fixed commit log and comment as per review comments.

 arch/x86/kernel/apic/io_apic.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 73ff4dd426a8..810fc58e3c42 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2675,6 +2675,18 @@ static struct resource * __init ioapic_setup_resources(void)
 	return res;
 }
 
+static void io_apic_set_fixmap_nocache(enum fixed_addresses idx,
+				       phys_addr_t phys)
+{
+	pgprot_t flags = FIXMAP_PAGE_NOCACHE;
+
+	/* Set TDX guest shared bit in pgprot flags */
+	if (is_tdx_guest())
+		flags = pgprot_tdg_shared(flags);
+
+	__set_fixmap(idx, phys, flags);
+}
+
 void __init io_apic_init_mappings(void)
 {
 	unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
@@ -2707,7 +2719,7 @@ void __init io_apic_init_mappings(void)
 				      __func__, PAGE_SIZE, PAGE_SIZE);
 			ioapic_phys = __pa(ioapic_phys);
 		}
-		set_fixmap_nocache(idx, ioapic_phys);
+		io_apic_set_fixmap_nocache(idx, ioapic_phys);
 		apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
 			__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
 			ioapic_phys);
@@ -2836,7 +2848,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
 	ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
 	ioapics[idx].mp_config.apicaddr = address;
 
-	set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
+	io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
 	if (bad_ioapic_register(idx)) {
 		clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
 		return -ENODEV;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 1/2] x86/tdx: Handle MWAIT and MONITOR
  2021-05-11 17:48                     ` Andi Kleen
@ 2021-05-24 23:32                       ` Kuppuswamy Sathyanarayanan
  2021-05-24 23:32                         ` [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest Kuppuswamy Sathyanarayanan
  2021-05-25  2:26                         ` [RFC v2-fix-v2 1/2] x86/tdx: Handle MWAIT and MONITOR Dan Williams
  0 siblings, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-24 23:32 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, Kuppuswamy Sathyanarayanan, linux-kernel

When running as a TDX guest, there are a number of existing,
privileged instructions that do not work. If the guest kernel
uses these instructions, the hardware generates a #VE.

You can find the list of unsupported instructions in Intel
Trust Domain Extensions (Intel® TDX) Module specification,
sec 9.2.2 and in Guest-Host Communication Interface (GHCI)
Specification for Intel TDX, sec 2.4.1.

To prevent TD guests from using MWAIT/MONITOR instructions,
the CPUID flags for these instructions are already disabled
by the TDX module. 
   
After the above mentioned preventive measures, if TD guests
still execute these instructions, add appropriate warning
message (WARN_ONCE()) in #VE handler. This handling behavior
is same as KVM (which also treats MWAIT/MONITOR as nops with
warning once in unsupported platforms).

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---

Changes since RFC v2:
 * Moved WBINVD related changes to a new patch.
 * Fixed commit log as per review comments.

 arch/x86/kernel/tdx.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 3e961fdfdae0..3800c7cbace3 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -511,6 +511,14 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_EPT_VIOLATION:
 		ve->instr_len = tdg_handle_mmio(regs, ve);
 		break;
+	case EXIT_REASON_MONITOR_INSTRUCTION:
+	case EXIT_REASON_MWAIT_INSTRUCTION:
+		/*
+		 * Something in the kernel used MONITOR or MWAIT despite
+		 * X86_FEATURE_MWAIT being cleared for TDX guests.
+		 */
+		WARN_ONCE(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n");
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-24 23:32                       ` [RFC v2-fix-v2 1/2] x86/tdx: Handle MWAIT and MONITOR Kuppuswamy Sathyanarayanan
@ 2021-05-24 23:32                         ` Kuppuswamy Sathyanarayanan
  2021-05-24 23:39                           ` Dan Williams
  2021-05-24 23:42                           ` Dave Hansen
  2021-05-25  2:26                         ` [RFC v2-fix-v2 1/2] x86/tdx: Handle MWAIT and MONITOR Dan Williams
  1 sibling, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-24 23:32 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, Kuppuswamy Sathyanarayanan, linux-kernel

Functionally only DMA devices can notice a side effect from
WBINVD's cache flushing. But, TDX does not support DMA,
because DMA typically needs uncached access for MMIO, and
the current TDX module always sets the IgnorePAT bit, which
prevents that.

So handle the WBINVD instruction as nop. Currently, we did
not include any warning for WBINVD handling because ACPI
reboot code uses it. This is the same behavior as KVM. It only
allows WBINVD in a guest when the guest supports VT-d (=DMA),
but just handles it as a nop if it doesn't .

If TDX ever gets DMA support, a hypercall will be added to
implement it similar to AMD-SEV. But current TDX does not
support direct DMA.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2:
 * Fixed commit log as per review comments.
 * Removed WARN_ONCE for WBINVD #VE support.

 arch/x86/kernel/tdx.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 3800c7cbace3..21dec5bfc88e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -511,6 +511,12 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_EPT_VIOLATION:
 		ve->instr_len = tdg_handle_mmio(regs, ve);
 		break;
+	case EXIT_REASON_WBINVD:
+		/*
+		 * Non coherent DMA is not supported in TDX guest.
+		 * So ignore WBINVD and treat it nop.
+		 */
+		break;
 	case EXIT_REASON_MONITOR_INSTRUCTION:
 	case EXIT_REASON_MWAIT_INSTRUCTION:
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-24 23:32                         ` [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest Kuppuswamy Sathyanarayanan
@ 2021-05-24 23:39                           ` Dan Williams
  2021-05-25  0:29                             ` Kuppuswamy, Sathyanarayanan
  2021-05-25  0:36                             ` Andi Kleen
  2021-05-24 23:42                           ` Dave Hansen
  1 sibling, 2 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-24 23:39 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Mon, May 24, 2021 at 4:32 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> Functionally only DMA devices can notice a side effect from
> WBINVD's cache flushing. But, TDX does not support DMA,
> because DMA typically needs uncached access for MMIO, and
> the current TDX module always sets the IgnorePAT bit, which
> prevents that.

I thought we discussed that there are other considerations for wbinvd
besides DMA? In any event this paragraph is actively misleading
because it disregards ACPI and Persistent Memory secure-erase whose
usages of wbinvd have nothing to do with DMA. I would much prefer a
patch to shutdown all the known wbinvd users as a precursor to this
patch rather than assuming it's ok to simply ignore it. You have
mentioned that TDX does not need to use those paths, but rather than
assume they can't be used why not do the audit to explicitly disable
them? Otherwise this statement seems to imply that the audit has not
been done.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-24 23:32                         ` [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest Kuppuswamy Sathyanarayanan
  2021-05-24 23:39                           ` Dan Williams
@ 2021-05-24 23:42                           ` Dave Hansen
  2021-05-25  0:39                             ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-24 23:42 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

On 5/24/21 4:32 PM, Kuppuswamy Sathyanarayanan wrote:
> Functionally only DMA devices can notice a side effect from
> WBINVD's cache flushing.

This seems to be trying to make some kind of case that the only visible
effects from WBINVD are for DMA devices.  That's flat out wrong.  It
might be arguable that none of the other cases exist in a TDX guest, but
it doesn't excuse making such a broad statement without qualification.

Just grep in the kernel for a bunch of reasons this is wrong.

Where did this come from?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-24 23:39                           ` Dan Williams
@ 2021-05-25  0:29                             ` Kuppuswamy, Sathyanarayanan
  2021-05-25  0:50                               ` Dan Williams
  2021-05-25  0:36                             ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-25  0:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List



On 5/24/21 4:39 PM, Dan Williams wrote:
>> Functionally only DMA devices can notice a side effect from
>> WBINVD's cache flushing. But, TDX does not support DMA,
>> because DMA typically needs uncached access for MMIO, and
>> the current TDX module always sets the IgnorePAT bit, which
>> prevents that.

> I thought we discussed that there are other considerations for wbinvd
> besides DMA? In any event this paragraph is actively misleading
> because it disregards ACPI and Persistent Memory secure-erase whose
> usages of wbinvd have nothing to do with DMA. I would much prefer a
> patch to shutdown all the known wbinvd users as a precursor to this
> patch rather than assuming it's ok to simply ignore it. You have
> mentioned that TDX does not need to use those paths, but rather than
> assume they can't be used why not do the audit to explicitly disable
> them? Otherwise this statement seems to imply that the audit has not
> been done.

But KVM also emulates WBINVD only if DMA is supported. Otherwise it
will be treated as noop.

static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
{
         return kvm_arch_has_noncoherent_dma(vcpu->kvm);
}



-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-24 23:39                           ` Dan Williams
  2021-05-25  0:29                             ` Kuppuswamy, Sathyanarayanan
@ 2021-05-25  0:36                             ` Andi Kleen
  1 sibling, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-25  0:36 UTC (permalink / raw)
  To: Dan Williams, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List


  that.

> I thought we discussed that there are other considerations for wbinvd
> besides DMA? In any event this paragraph is actively misleading
> because it disregards ACPI and Persistent Memory secure-erase whose
> usages of wbinvd have nothing to do with DMA.


In this case they would be broken in KVM too.


> I would much prefer a
> patch to shutdown all the known wbinvd users as a precursor to this
> patch rather than assuming it's ok to simply ignore it. You have
> mentioned that TDX does not need to use those paths, but rather than
> assume they can't be used why not do the audit to explicitly disable
> them? Otherwise this statement seems to imply that the audit has not
> been done.

We're not assuming it. We know it because KVM does it since forever.

All we want to do is do the same as KVM.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-24 23:42                           ` Dave Hansen
@ 2021-05-25  0:39                             ` Andi Kleen
  2021-05-25  0:53                               ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-25  0:39 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel


On 5/24/2021 4:42 PM, Dave Hansen wrote:
> On 5/24/21 4:32 PM, Kuppuswamy Sathyanarayanan wrote:
>> Functionally only DMA devices can notice a side effect from
>> WBINVD's cache flushing.
> This seems to be trying to make some kind of case that the only visible
> effects from WBINVD are for DMA devices.  That's flat out wrong.  It
> might be arguable that none of the other cases exist in a TDX guest, but
> it doesn't excuse making such a broad statement without qualification.

We're describing a few sentences down that guests run with EPT 
IgnorePAT=1, which is the qualification.

>
> Just grep in the kernel for a bunch of reasons this is wrong.
>
> Where did this come from?

Again the logic is very simple: TDX guest code is (mostly) about 
replacing KVM code with in kernel code, so we're just doing the same as 
KVM. You cannot get any more proven than that.


-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-25  0:29                             ` Kuppuswamy, Sathyanarayanan
@ 2021-05-25  0:50                               ` Dan Williams
  2021-05-25  0:54                                 ` Sean Christopherson
  2021-05-25  1:02                                 ` Andi Kleen
  0 siblings, 2 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-25  0:50 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Mon, May 24, 2021 at 5:30 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
>
>
> On 5/24/21 4:39 PM, Dan Williams wrote:
> >> Functionally only DMA devices can notice a side effect from
> >> WBINVD's cache flushing. But, TDX does not support DMA,
> >> because DMA typically needs uncached access for MMIO, and
> >> the current TDX module always sets the IgnorePAT bit, which
> >> prevents that.
>
> > I thought we discussed that there are other considerations for wbinvd
> > besides DMA? In any event this paragraph is actively misleading
> > because it disregards ACPI and Persistent Memory secure-erase whose
> > usages of wbinvd have nothing to do with DMA. I would much prefer a
> > patch to shutdown all the known wbinvd users as a precursor to this
> > patch rather than assuming it's ok to simply ignore it. You have
> > mentioned that TDX does not need to use those paths, but rather than
> > assume they can't be used why not do the audit to explicitly disable
> > them? Otherwise this statement seems to imply that the audit has not
> > been done.
>
> But KVM also emulates WBINVD only if DMA is supported. Otherwise it
> will be treated as noop.
>
> static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
> {
>          return kvm_arch_has_noncoherent_dma(vcpu->kvm);
> }

That makes KVM also broken for the cases where wbinvd is needed, but
it does not make the description of this patch correct.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-25  0:39                             ` Andi Kleen
@ 2021-05-25  0:53                               ` Dan Williams
  0 siblings, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-25  0:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Mon, May 24, 2021 at 5:40 PM Andi Kleen <ak@linux.intel.com> wrote:
>
>
> On 5/24/2021 4:42 PM, Dave Hansen wrote:
> > On 5/24/21 4:32 PM, Kuppuswamy Sathyanarayanan wrote:
> >> Functionally only DMA devices can notice a side effect from
> >> WBINVD's cache flushing.
> > This seems to be trying to make some kind of case that the only visible
> > effects from WBINVD are for DMA devices.  That's flat out wrong.  It
> > might be arguable that none of the other cases exist in a TDX guest, but
> > it doesn't excuse making such a broad statement without qualification.
>
> We're describing a few sentences down that guests run with EPT
> IgnorePAT=1, which is the qualification.
>
> >
> > Just grep in the kernel for a bunch of reasons this is wrong.
> >
> > Where did this come from?
>
> Again the logic is very simple: TDX guest code is (mostly) about
> replacing KVM code with in kernel code, so we're just doing the same as
> KVM. You cannot get any more proven than that.
>

I have no problem pointing at KVM as to why the risk is mitigated, but
I do have a problem with misrepresenting the scope of the risk.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-25  0:50                               ` Dan Williams
@ 2021-05-25  0:54                                 ` Sean Christopherson
  2021-05-25  1:02                                 ` Andi Kleen
  1 sibling, 0 replies; 381+ messages in thread
From: Sean Christopherson @ 2021-05-25  0:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Linux Kernel Mailing List

On Mon, May 24, 2021, Dan Williams wrote:
> On Mon, May 24, 2021 at 5:30 PM Kuppuswamy, Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> >
> >
> >
> > On 5/24/21 4:39 PM, Dan Williams wrote:
> > >> Functionally only DMA devices can notice a side effect from
> > >> WBINVD's cache flushing. But, TDX does not support DMA,
> > >> because DMA typically needs uncached access for MMIO, and
> > >> the current TDX module always sets the IgnorePAT bit, which
> > >> prevents that.
> >
> > > I thought we discussed that there are other considerations for wbinvd
> > > besides DMA? In any event this paragraph is actively misleading
> > > because it disregards ACPI and Persistent Memory secure-erase whose
> > > usages of wbinvd have nothing to do with DMA. I would much prefer a
> > > patch to shutdown all the known wbinvd users as a precursor to this
> > > patch rather than assuming it's ok to simply ignore it. You have
> > > mentioned that TDX does not need to use those paths, but rather than
> > > assume they can't be used why not do the audit to explicitly disable
> > > them? Otherwise this statement seems to imply that the audit has not
> > > been done.
> >
> > But KVM also emulates WBINVD only if DMA is supported. Otherwise it
> > will be treated as noop.
> >
> > static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
> > {
> >          return kvm_arch_has_noncoherent_dma(vcpu->kvm);
> > }
> 
> That makes KVM also broken for the cases where wbinvd is needed, but
> it does not make the description of this patch correct.

Yep!  KVM has a long and dubious history of making things work for specific use
cases without stricitly adhering to the architecture.

KVM also has to worry about malicious/buggy guests, e.g. letting the guest do
WBINVD at will would be a massive noisy neighbor problem (at best), while
ratelimiting might unnecessarily harm legitimate use case.  I.e. KVM has a
somewhat sane reason for "emulating" WBINVD as a nop.

And FWIW, IIRC all modern hardware has a coherent IOMMU, though that could be me
making things up.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-25  0:50                               ` Dan Williams
  2021-05-25  0:54                                 ` Sean Christopherson
@ 2021-05-25  1:02                                 ` Andi Kleen
  2021-05-25  1:45                                   ` Dan Williams
  1 sibling, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-25  1:02 UTC (permalink / raw)
  To: Dan Williams, Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List


> That makes KVM also broken for the cases where wbinvd is needed,


Or maybe your analysis is wrong?


> but
> it does not make the description of this patch correct.

If KVM was broken I'm sure we would hear about it.

The ACPI cases are for S3, which is not supported in guests, or for the 
old style manual IO port C6, which isn't supported either.

The persistent memory cases would require working DMA mappings, which we 
currently don't support. If DMA mappings were added we would need to 
para virtualized WBINVD, like the comments say.

AFAIK all the rest is for some caching attribute change, which is not 
possible in KVM (because it uses EPT.IgnorePAT=1) nor in TDX (which does 
the same). Some are for MTRR which is completely disabled if you're 
running under EPT.

-Andi

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-25  1:02                                 ` Andi Kleen
@ 2021-05-25  1:45                                   ` Dan Williams
  2021-05-25  2:13                                     ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-25  1:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Mon, May 24, 2021 at 6:02 PM Andi Kleen <ak@linux.intel.com> wrote:
>
>
> > That makes KVM also broken for the cases where wbinvd is needed,
>
>
> Or maybe your analysis is wrong?

I'm well aware of the fact that wbinvd is problematic for hypervisors
and is an attack vector for a guest to DOS the host.

>
>
> > but
> > it does not make the description of this patch correct.
>
> If KVM was broken I'm sure we would hear about it.

KVM does not try to support the cases where wbinvd being unavailable
would break the system. That is not the claim being made in this
patch.

> The ACPI cases are for S3, which is not supported in guests, or for the
> old style manual IO port C6, which isn't supported either.

> The persistent memory cases would require working DMA mappings,

No, that analysis is wrong.The wbinvd audit would have found that
persistent memory secure-erase and unlock, which has nothing to do
with DMA, needs wbinvd to ensure that the CPU has not retained a copy
of the PMEM contents from before the unlock happened and it needs to
make sure that any data that was meant to be destroyed by an erasure
is not retained in cache.

> which we
> currently don't support. If DMA mappings were added we would need to
> para virtualized WBINVD, like the comments say.
>
> AFAIK all the rest is for some caching attribute change, which is not
> possible in KVM (because it uses EPT.IgnorePAT=1) nor in TDX (which does
> the same). Some are for MTRR which is completely disabled if you're
> running under EPT.

It's fine to not support the above cases, I am asking for the
explanation to demonstrate the known risks and the known mitigations.
IgnorePAT is not the mitigation, the mitigation is an audit to
describe why the known users are unlikely to be triggered. Even better
would be an addition patch that does something like:

iff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index 4b80150e4afa..a6b13a1ae319 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -170,6 +170,9 @@ static int __nvdimm_security_unlock(struct nvdimm *nvdimm)
        const void *data;
        int rc;

+       if (is_protected_guest())
+               return -ENXIO;
+
        /* The bus lock should be held at the top level of the call stack */
        lockdep_assert_held(&nvdimm_bus->reconfig_mutex);

...to explicitly error out a wbinvd use case before data is altered
and wbinvd is needed.

^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-25  1:45                                   ` Dan Williams
@ 2021-05-25  2:13                                     ` Andi Kleen
  2021-05-25  2:49                                       ` Dan Williams
  2021-05-25  4:32                                       ` [RFC v2-fix-v2 2/2] x86/tdx: Ignore " Dave Hansen
  0 siblings, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-25  2:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List


On 5/24/2021 6:45 PM, Dan Williams wrote:
>
>>
>>> but
>>> it does not make the description of this patch correct.
>> If KVM was broken I'm sure we would hear about it.
> KVM does not try to support the cases where wbinvd being unavailable
> would break the system. That is not the claim being made in this
> patch.

I thought we made that claim.


"We just want to be the same as KVM"

>
>> The ACPI cases are for S3, which is not supported in guests, or for the
>> old style manual IO port C6, which isn't supported either.
>> The persistent memory cases would require working DMA mappings,
> No, that analysis is wrong.The wbinvd audit would have found that
> persistent memory secure-erase and unlock, which has nothing to do
> with DMA, needs wbinvd to ensure that the CPU has not retained a copy
> of the PMEM contents from before the unlock happened and it needs to
> make sure that any data that was meant to be destroyed by an erasure
> is not retained in cache.

But that's all not supported in TDX.

And the only way it could work in KVM is when there is some DMA, likely 
at least an IOMMU, e.g. to set up the persistent memory. That's what I 
meant with working DMA mappings.

Otherwise KVM would be really broken, but I don't really believe that 
without some real evidence.


>
> It's fine to not support the above cases, I am asking for the
> explanation to demonstrate the known risks and the known mitigations.

The analysis is that all this stuff that you are worried about cannot be 
enabled in a TDX guest

(it would be a nightmare if it could, we would need to actually make it 
secure against a malicious host)

> IgnorePAT is not the mitigation, the mitigation is an audit to
> describe why the known users are unlikely to be triggered. Even better
> would be an addition patch that does something like:
>
> iff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
> index 4b80150e4afa..a6b13a1ae319 100644
> --- a/drivers/nvdimm/security.c
> +++ b/drivers/nvdimm/security.c
> @@ -170,6 +170,9 @@ static int __nvdimm_security_unlock(struct nvdimm *nvdimm)
>          const void *data;
>          int rc;
>
> +       if (is_protected_guest())
> +               return -ENXIO;
> +
>          /* The bus lock should be held at the top level of the call stack */
>          lockdep_assert_held(&nvdimm_bus->reconfig_mutex);
>
> ...to explicitly error out a wbinvd use case before data is altered
> and wbinvd is needed.

I don't see any point of all of this. We really just want to be the same 
as KVM. Not get into the business of patching a bazillion sub systems 
that cannot be used in TDX anyways.

-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/2] x86/tdx: Handle MWAIT and MONITOR
  2021-05-24 23:32                       ` [RFC v2-fix-v2 1/2] x86/tdx: Handle MWAIT and MONITOR Kuppuswamy Sathyanarayanan
  2021-05-24 23:32                         ` [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest Kuppuswamy Sathyanarayanan
@ 2021-05-25  2:26                         ` Dan Williams
  1 sibling, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-25  2:26 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Mon, May 24, 2021 at 4:32 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> When running as a TDX guest, there are a number of existing,
> privileged instructions that do not work. If the guest kernel
> uses these instructions, the hardware generates a #VE.
>
> You can find the list of unsupported instructions in Intel
> Trust Domain Extensions (Intel® TDX) Module specification,
> sec 9.2.2 and in Guest-Host Communication Interface (GHCI)
> Specification for Intel TDX, sec 2.4.1.
>
> To prevent TD guests from using MWAIT/MONITOR instructions,
> the CPUID flags for these instructions are already disabled
> by the TDX module.
>
> After the above mentioned preventive measures, if TD guests
> still execute these instructions, add appropriate warning
> message (WARN_ONCE()) in #VE handler. This handling behavior
> is same as KVM (which also treats MWAIT/MONITOR as nops with
> warning once in unsupported platforms).
>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> ---
>
> Changes since RFC v2:
>  * Moved WBINVD related changes to a new patch.
>  * Fixed commit log as per review comments.

Looks good.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-25  2:13                                     ` Andi Kleen
@ 2021-05-25  2:49                                       ` Dan Williams
  2021-05-25  3:27                                         ` Andi Kleen
  2021-05-25  4:32                                       ` [RFC v2-fix-v2 2/2] x86/tdx: Ignore " Dave Hansen
  1 sibling, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-25  2:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Mon, May 24, 2021 at 7:13 PM Andi Kleen <ak@linux.intel.com> wrote:
[..]
> > ...to explicitly error out a wbinvd use case before data is altered
> > and wbinvd is needed.
>
> I don't see any point of all of this. We really just want to be the same
> as KVM. Not get into the business of patching a bazillion sub systems
> that cannot be used in TDX anyways.

Please let's not start this patch off with dubious claims of safety
afforded by IgnorePAT. Instead make the true argument that wbinvd is
known to be problematic in guests and for that reason many bare metal
use cases that require wbinvd have not been ported to guests (like
PMEM unlock), and others that only use wbinvd to opportunistically
enforce a cache state (like ACPI sleep states) do not see ill effects
from missing wbinvd. Given KVM ships with a policy to elide wbinvd in
many scenarios adopt the same policy for TDX guests.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-25  2:49                                       ` Dan Williams
@ 2021-05-25  3:27                                         ` Andi Kleen
  2021-05-25  3:40                                           ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-25  3:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List


On 5/24/2021 7:49 PM, Dan Williams wrote:
> On Mon, May 24, 2021 at 7:13 PM Andi Kleen <ak@linux.intel.com> wrote:
> [..]
>>> ...to explicitly error out a wbinvd use case before data is altered
>>> and wbinvd is needed.
>> I don't see any point of all of this. We really just want to be the same
>> as KVM. Not get into the business of patching a bazillion sub systems
>> that cannot be used in TDX anyways.
> Please let's not start this patch off with dubious claims of safety
> afforded by IgnorePAT. Instead make the true argument that wbinvd is
> known to be problematic in guests

That's just another reason to not support WBINVD, but I don't think it's 
the main reason. The main reason is that it is simply not needed, unless 
you do DMA in some form.

(and yes I consider direct mapping of persistent memory with a complex 
setup procedure a form of DMA -- my guess is that the reason that it 
works in KVM is that it somehow activates the DMA code paths in KVM)

IMNSHO that's the true reason.

> and for that reason many bare metal
> use cases that require wbinvd have not been ported to guests (like
> PMEM unlock), and others that only use wbinvd to opportunistically
> enforce a cache state (like ACPI sleep states)

ACPI sleep states are not supported or needed in virtualization. They 
are mostly obsolete on real hardware too.


> do not see ill effects
> from missing wbinvd. Given KVM ships with a policy to elide wbinvd in
> many scenarios adopt the same policy for TDX guests.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-25  3:27                                         ` Andi Kleen
@ 2021-05-25  3:40                                           ` Dan Williams
  2021-05-26  1:09                                             ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-25  3:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Mon, May 24, 2021 at 8:27 PM Andi Kleen <ak@linux.intel.com> wrote:
>
>
> On 5/24/2021 7:49 PM, Dan Williams wrote:
> > On Mon, May 24, 2021 at 7:13 PM Andi Kleen <ak@linux.intel.com> wrote:
> > [..]
> >>> ...to explicitly error out a wbinvd use case before data is altered
> >>> and wbinvd is needed.
> >> I don't see any point of all of this. We really just want to be the same
> >> as KVM. Not get into the business of patching a bazillion sub systems
> >> that cannot be used in TDX anyways.
> > Please let's not start this patch off with dubious claims of safety
> > afforded by IgnorePAT. Instead make the true argument that wbinvd is
> > known to be problematic in guests
>
> That's just another reason to not support WBINVD, but I don't think it's
> the main reason. The main reason is that it is simply not needed, unless
> you do DMA in some form.
>
> (and yes I consider direct mapping of persistent memory with a complex
> setup procedure a form of DMA -- my guess is that the reason that it
> works in KVM is that it somehow activates the DMA code paths in KVM)

No, it doesn't. Simply no one has tried to pass through the security
interface of bare metal nvdimm to a guest, or enabled the security
commands in a virtualized nvdimm. If a guest supports a memory map it
supports PMEM I struggle to see DMA anywhere in that equation.

>
> IMNSHO that's the true reason.

I do see why it would be attractive if IgnorePAT was a solid signal to
ditch wbinvd support. However, it simply isn't, and to date nothing
has cared trip over that gap.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-25  2:13                                     ` Andi Kleen
  2021-05-25  2:49                                       ` Dan Williams
@ 2021-05-25  4:32                                       ` Dave Hansen
  1 sibling, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-25  4:32 UTC (permalink / raw)
  To: Andi Kleen, Dan Williams
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On 5/24/21 7:13 PM, Andi Kleen wrote:
> I don't see any point of all of this. We really just want to be the same
> as KVM. Not get into the business of patching a bazillion sub systems
> that cannot be used in TDX anyways.

Andi, there's a fundamental difference between KVM the hypervisor and a
TDX guest: KVM the hypervisor runs unknown guests, and lots of them.

TD guest support as a whole has to handle one thing: running *one* Linux
kernel.  Further, the guest support shares a source tree with that
kernel.  TD guest support doesn't have to run random binaries for which
there is no source.  All of the source is *RIGHT* *THERE*.

The only reason TD guest support would have to fall back to KVM's dirty
tricks is a desire to treat the rest of the kernel like a black box.
KVM frankly has no other choice.  TD guest support has all the choices
in the world.


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-21 21:14                       ` Tom Lendacky
@ 2021-05-25 18:21                         ` Kuppuswamy, Sathyanarayanan
  2021-05-31 15:13                           ` Borislav Petkov
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-25 18:21 UTC (permalink / raw)
  To: Tom Lendacky, Borislav Petkov
  Cc: Sean Christopherson, Dave Hansen, Andi Kleen, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel,
	Brijesh Singh

Hi,

On 5/21/21 2:14 PM, Tom Lendacky wrote:
> 
> 
> On 5/21/21 1:49 PM, Borislav Petkov wrote:
>> On Fri, May 21, 2021 at 11:19:15AM -0500, Tom Lendacky wrote:
>>> In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
>>> when SEV support was added), we do:
>>> 	if (sev_active())
>>> 		swiotlb_force = SWIOTLB_FORCE;
>>>
>>> TDX should be able to do a similar thing without having to touch
>>> arch/x86/kernel/pci-swiotlb.c.
>>>
>>> That would remove any confusion over SME being part of a
>>> protected_guest_has() call.
>>
>> Even better.
>>
>>> I kinda like the separate function, though.
>>
>> Only if you clean it up and get rid of the inverted logic and drop that
>> silly switch-case.
>>
>>> Except mem_encrypt_active() covers both SME and SEV, so
>>> protected_guest_has() would be confusing.
>>
>> I don't understand - the AMD-specific function amd_protected_guest_has()
>> would return sme_me_mask just like mem_encrypt_active() does and we can
>> get rid of latter.
>>
>> Or do you have a problem with the name protected_guest_has() containing
>> "guest" while we're talking about SME here?
> 
> The latter.
> 
>>
>> If so, feel free to suggest a better one - the name does not have to
>> have "guest" in it.
> 
> Let me see if I can come up with something that will make sense.
> 
> Thanks,
> Tom
> 
>>
>> Thx.
>>
>>

Following is the sample implementation. Please let me know your
comments.

     tdx: Introduce generic protected_guest abstraction

     Add a generic way to check if we run with an encrypted guest,
     without requiring x86 specific ifdefs. This can then be used in
     non architecture specific code.

     is_protected_guest() helper function can be implemented using
     arch specific CPU feature flags.

     protected_guest_has() is used to check for protected guest
     feature flags.

     Originally-by: Andi Kleen <ak@linux.intel.com>
     Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

diff --git a/arch/Kconfig b/arch/Kconfig
index ecfd3520b676..98c30312555b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -956,6 +956,9 @@ config HAVE_ARCH_NVRAM_OPS
  config ISA_BUS_API
         def_bool ISA

+config ARCH_HAS_PROTECTED_GUEST
+       bool
+
  #
  # ABI hall of shame
  #
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bc91c4aa7ce4..2f31613be965 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -879,6 +879,7 @@ config INTEL_TDX_GUEST
         select X86_X2APIC
         select SECURITY_LOCKDOWN_LSM
         select X86_MEM_ENCRYPT_COMMON
+       select ARCH_HAS_PROTECTED_GUEST
         help
           Provide support for running in a trusted domain on Intel processors
           equipped with Trusted Domain eXtenstions. TDX is a new Intel
diff --git a/arch/x86/include/asm/protected_guest.h b/arch/x86/include/asm/protected_guest.h
new file mode 100644
index 000000000000..b2838e58ce94
--- /dev/null
+++ b/arch/x86/include/asm/protected_guest.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_PROTECTED_GUEST
+#define _ASM_PROTECTED_GUEST 1
+
+#include <asm/cpufeature.h>
+#include <asm/tdx.h>
+
+/* Only include through linux/protected_guest.h */
+
+static inline bool is_protected_guest(void)
+{
+       return boot_cpu_has(X86_FEATURE_TDX_GUEST);
+}
+
+static inline bool protected_guest_has(unsigned long flag)
+{
+       if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
+               return tdx_protected_guest_has(flag);
+
+       return false;
+}
+
+#endif
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 175cebb7bf94..d894111f49ea 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -147,6 +147,7 @@ do {                                                                        \
  extern phys_addr_t tdg_shared_mask(void);
  extern int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
                                 enum tdx_map_type map_type);
+bool tdx_protected_guest_has(unsigned long flag);

  #else // !CONFIG_INTEL_TDX_GUEST

@@ -167,6 +168,11 @@ static inline int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
  {
         return -ENODEV;
  }
+
+static inline bool tdx_protected_guest_has(unsigned long flag)
+{
+       return false;
+}
  #endif /* CONFIG_INTEL_TDX_GUEST */

  #ifdef CONFIG_INTEL_TDX_GUEST_KVM
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index c613c89d0d6a..cbb893412b43 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -11,6 +11,7 @@
  #include <linux/sched/signal.h> /* force_sig_fault() */
  #include <linux/swiotlb.h>
  #include <linux/security.h>
+#include <linux/protected_guest.h>

  #include <linux/cpu.h>

@@ -122,6 +123,23 @@ bool is_tdx_guest(void)
  }
  EXPORT_SYMBOL_GPL(is_tdx_guest);

+bool tdx_protected_guest_has(unsigned long flag)
+{
+       if (!is_tdx_guest())
+               return false;
+
+       switch (flag) {
+       case VM_MEM_ENCRYPT:
+       case VM_MEM_ENCRYPT_ACTIVE:
+       case VM_UNROLL_STRING_IO:
+       case VM_HOST_MEM_ENCRYPT:
+               return true;
+       }
+
+       return false;
+}
+EXPORT_SYMBOL_GPL(tdx_protected_guest_has);
+
  /* The highest bit of a guest physical address is the "sharing" bit */
  phys_addr_t tdg_shared_mask(void)
  {
diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
new file mode 100644
index 000000000000..f362eea39bd8
--- /dev/null
+++ b/include/linux/protected_guest.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_PROTECTED_GUEST_H
+#define _LINUX_PROTECTED_GUEST_H 1
+
+/* Protected Guest Feature Flags (leave 0-0xff for arch specific flags) */
+
+/* Support for guest encryption */
+#define VM_MEM_ENCRYPT                 0x100
+/* Encryption support is active */
+#define VM_MEM_ENCRYPT_ACTIVE          0x101
+/* Support for unrolled string IO */
+#define VM_UNROLL_STRING_IO            0x102
+/* Support for host memory encryption */
+#define VM_HOST_MEM_ENCRYPT            0x103
+
+#ifdef CONFIG_ARCH_HAS_PROTECTED_GUEST
+#include <asm/protected_guest.h>
+#else
+static inline bool is_protected_guest(void) { return false; }
+static inline bool protected_guest_has(unsigned long flag) { return false; }
+#endif
+
+#endif


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-25  3:40                                           ` Dan Williams
@ 2021-05-26  1:09                                             ` Andi Kleen
  2021-05-27  4:38                                               ` [RFC v2-fix-v3 1/1] " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-26  1:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List


On 5/24/2021 8:40 PM, Dan Williams wrote:
> On Mon, May 24, 2021 at 8:27 PM Andi Kleen <ak@linux.intel.com> wrote:
>>
>> On 5/24/2021 7:49 PM, Dan Williams wrote:
>>> On Mon, May 24, 2021 at 7:13 PM Andi Kleen <ak@linux.intel.com> wrote:
>>> [..]
>>>>> ...to explicitly error out a wbinvd use case before data is altered
>>>>> and wbinvd is needed.
>>>> I don't see any point of all of this. We really just want to be the same
>>>> as KVM. Not get into the business of patching a bazillion sub systems
>>>> that cannot be used in TDX anyways.
>>> Please let's not start this patch off with dubious claims of safety
>>> afforded by IgnorePAT. Instead make the true argument that wbinvd is
>>> known to be problematic in guests
>> That's just another reason to not support WBINVD, but I don't think it's
>> the main reason. The main reason is that it is simply not needed, unless
>> you do DMA in some form.
>>
>> (and yes I consider direct mapping of persistent memory with a complex
>> setup procedure a form of DMA -- my guess is that the reason that it
>> works in KVM is that it somehow activates the DMA code paths in KVM)
> No, it doesn't. Simply no one has tried to pass through the security
> interface of bare metal nvdimm to a guest, or enabled the security
> commands in a virtualized nvdimm.

Maybe a better term would be "external side effects". If you have 
something in IO domain which can notice a difference.

> If a guest supports a memory map it supports PMEM I struggle to see DMA anywhere in that equation.

Okay if that's happen to a TDX guest we have to start emulate WBINVD. 
But right now we don't need it.

I guess we can add a comment that says

"if someone wants to implement NVDIMM secure delete they would also need 
to implement this new hypercall"

>
>> IMNSHO that's the true reason.
> I do see why it would be attractive if IgnorePAT was a solid signal to
> ditch wbinvd support. However, it simply isn't, and to date nothing
> has cared trip over that gap.


I think we're getting into angels on a pinhead here.

The key point is that current TDX does not need WBINVD. I believe we 
agree on that.


-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-21 16:19                   ` Tom Lendacky
  2021-05-21 18:49                     ` Borislav Petkov
@ 2021-05-26 21:37                     ` Kuppuswamy, Sathyanarayanan
  2021-05-26 22:02                       ` Tom Lendacky
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-26 21:37 UTC (permalink / raw)
  To: Tom Lendacky, Borislav Petkov
  Cc: Sean Christopherson, Dave Hansen, Andi Kleen, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel,
	Brijesh Singh



On 5/21/21 9:19 AM, Tom Lendacky wrote:
> In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
> when SEV support was added), we do:
> 	if (sev_active())
> 		swiotlb_force = SWIOTLB_FORCE;
> 
> TDX should be able to do a similar thing without having to touch
> arch/x86/kernel/pci-swiotlb.c.
> 
> That would remove any confusion over SME being part of a
> protected_guest_has() call.

You mean sme_active() check in arch/x86/kernel/pci-swiotlb.c is redundant?

  41 int __init pci_swiotlb_detect_4gb(void)
  42 {
  43         /* don't initialize swiotlb if iommu=off (no_iommu=1) */
  44         if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN)
  45                 swiotlb = 1;
  46
  47         /*
  48          * If SME is active then swiotlb will be set to 1 so that bounce
  49          * buffers are allocated and used for devices that do not support
  50          * the addressing range required for the encryption mask.
  51          */
  52         if (sme_active() || is_tdx_guest())
  53                 swiotlb = 1;


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-26 21:37                     ` Kuppuswamy, Sathyanarayanan
@ 2021-05-26 22:02                       ` Tom Lendacky
  2021-05-26 22:14                         ` Tom Lendacky
  0 siblings, 1 reply; 381+ messages in thread
From: Tom Lendacky @ 2021-05-26 22:02 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Borislav Petkov
  Cc: Sean Christopherson, Dave Hansen, Andi Kleen, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel,
	Brijesh Singh



On 5/26/21 4:37 PM, Kuppuswamy, Sathyanarayanan wrote:
> 
> 
> On 5/21/21 9:19 AM, Tom Lendacky wrote:
>> In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
>> when SEV support was added), we do:
>>     if (sev_active())
>>         swiotlb_force = SWIOTLB_FORCE;
>>
>> TDX should be able to do a similar thing without having to touch
>> arch/x86/kernel/pci-swiotlb.c.
>>
>> That would remove any confusion over SME being part of a
>> protected_guest_has() call.
> 
> You mean sme_active() check in arch/x86/kernel/pci-swiotlb.c is redundant?

No, the sme_active() check is required to make sure that SWIOTLB is
available under SME. Encrypted DMA is supported under SME if the device
supports 64-bit DMA. But if the device doesn't support 64-bit DMA and the
IOMMU is not active, then DMA will be bounced through SWIOTLB.

As compared to SEV, where all DMA has to be bounced through SWIOTLB or
unencrypted memory. For that, swiotlb_force is used.

Thanks,
Tom

> 
>  41 int __init pci_swiotlb_detect_4gb(void)
>  42 {
>  43         /* don't initialize swiotlb if iommu=off (no_iommu=1) */
>  44         if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN)
>  45                 swiotlb = 1;
>  46
>  47         /*
>  48          * If SME is active then swiotlb will be set to 1 so that bounce
>  49          * buffers are allocated and used for devices that do not support
>  50          * the addressing range required for the encryption mask.
>  51          */
>  52         if (sme_active() || is_tdx_guest())
>  53                 swiotlb = 1;
> 
> 

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-26 22:02                       ` Tom Lendacky
@ 2021-05-26 22:14                         ` Tom Lendacky
  2021-05-26 22:20                           ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Tom Lendacky @ 2021-05-26 22:14 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Borislav Petkov
  Cc: Sean Christopherson, Dave Hansen, Andi Kleen, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel,
	Brijesh Singh

On 5/26/21 5:02 PM, Tom Lendacky wrote:
> On 5/26/21 4:37 PM, Kuppuswamy, Sathyanarayanan wrote:
>>
>>
>> On 5/21/21 9:19 AM, Tom Lendacky wrote:
>>> In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
>>> when SEV support was added), we do:
>>>     if (sev_active())
>>>         swiotlb_force = SWIOTLB_FORCE;
>>>
>>> TDX should be able to do a similar thing without having to touch
>>> arch/x86/kernel/pci-swiotlb.c.
>>>
>>> That would remove any confusion over SME being part of a
>>> protected_guest_has() call.
>>
>> You mean sme_active() check in arch/x86/kernel/pci-swiotlb.c is redundant?
> 
> No, the sme_active() check is required to make sure that SWIOTLB is
> available under SME. Encrypted DMA is supported under SME if the device
> supports 64-bit DMA. But if the device doesn't support 64-bit DMA and the
> IOMMU is not active, then DMA will be bounced through SWIOTLB.
> 
> As compared to SEV, where all DMA has to be bounced through SWIOTLB or
> unencrypted memory. For that, swiotlb_force is used.

I should probably add that SME is memory encryption support for
host/hypervisor/bare-metal, while SEV is memory encryption support for
virtualization.

Thanks,
Tom

> 
> Thanks,
> Tom
> 
>>
>>  41 int __init pci_swiotlb_detect_4gb(void)
>>  42 {
>>  43         /* don't initialize swiotlb if iommu=off (no_iommu=1) */
>>  44         if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN)
>>  45                 swiotlb = 1;
>>  46
>>  47         /*
>>  48          * If SME is active then swiotlb will be set to 1 so that bounce
>>  49          * buffers are allocated and used for devices that do not support
>>  50          * the addressing range required for the encryption mask.
>>  51          */
>>  52         if (sme_active() || is_tdx_guest())
>>  53                 swiotlb = 1;
>>
>>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-26 22:14                         ` Tom Lendacky
@ 2021-05-26 22:20                           ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-26 22:20 UTC (permalink / raw)
  To: Tom Lendacky, Borislav Petkov
  Cc: Sean Christopherson, Dave Hansen, Andi Kleen, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel,
	Brijesh Singh



On 5/26/21 3:14 PM, Tom Lendacky wrote:
> On 5/26/21 5:02 PM, Tom Lendacky wrote:
>> On 5/26/21 4:37 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>
>>>
>>> On 5/21/21 9:19 AM, Tom Lendacky wrote:
>>>> In arch/x86/mm/mem_encrypt.c, sme_early_init() (should have renamed that
>>>> when SEV support was added), we do:
>>>>      if (sev_active())
>>>>          swiotlb_force = SWIOTLB_FORCE;
>>>>
>>>> TDX should be able to do a similar thing without having to touch
>>>> arch/x86/kernel/pci-swiotlb.c.
>>>>
>>>> That would remove any confusion over SME being part of a
>>>> protected_guest_has() call.
>>>
>>> You mean sme_active() check in arch/x86/kernel/pci-swiotlb.c is redundant?
>>
>> No, the sme_active() check is required to make sure that SWIOTLB is
>> available under SME. Encrypted DMA is supported under SME if the device
>> supports 64-bit DMA. But if the device doesn't support 64-bit DMA and the
>> IOMMU is not active, then DMA will be bounced through SWIOTLB.
>>
>> As compared to SEV, where all DMA has to be bounced through SWIOTLB or
>> unencrypted memory. For that, swiotlb_force is used.
> 
> I should probably add that SME is memory encryption support for
> host/hypervisor/bare-metal, while SEV is memory encryption support for
> virtualization.

Got it. Thanks for clarification.

> 
> Thanks,
> Tom
> 
>>
>> Thanks,
>> Tom
>>
>>>
>>>   41 int __init pci_swiotlb_detect_4gb(void)
>>>   42 {
>>>   43         /* don't initialize swiotlb if iommu=off (no_iommu=1) */
>>>   44         if (!no_iommu && max_possible_pfn > MAX_DMA32_PFN)
>>>   45                 swiotlb = 1;
>>>   46
>>>   47         /*
>>>   48          * If SME is active then swiotlb will be set to 1 so that bounce
>>>   49          * buffers are allocated and used for devices that do not support
>>>   50          * the addressing range required for the encryption mask.
>>>   51          */
>>>   52         if (sme_active() || is_tdx_guest())
>>>   53                 swiotlb = 1;
>>>
>>>

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-24 14:02             ` Andi Kleen
@ 2021-05-27  0:29               ` Kuppuswamy Sathyanarayanan
  2021-05-27 15:11                 ` Luck, Tony
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-27  0:29 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson,
	Kuppuswamy Sathyanarayanan, linux-kernel

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the kernel:

 * Specific instructions (WBINVD, for example)
 * Specific MSR accesses
 * Specific CPUID leaf accesses
 * Access to TD-shared memory, which includes MMIO

In the settings that Linux will run in, virtual exceptions are never
generated on accesses to normal, TD-private memory that has been
accepted.

The entry paths do not access TD-shared memory, MMIO regions or use
those specific MSRs, instructions, CPUID leaves that might generate #VE.
In addition, all interrupts including NMIs are blocked by the hardware
starting with #VE delivery until TDGETVEINFO is called.  This eliminates
the chance of a #VE during the syscall gap or paranoid entry paths and
simplifies #VE handling.

After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
but it is expected not to happen because TDX expects NMIs not to
trigger #VEs. Another case where they could happen is if the #VE
exception panics, but in this case there are no guarantees on anything
anyways.

If a guest kernel action which would normally cause a #VE occurs in the
interrupt-disabled region before TDGETVEINFO, a #DF is delivered to the
guest which will result in an oops (and should eventually be a panic, as
we would like to set panic_on_oops to 1 for TDX guests).

Add basic infrastructure to handle any #VE which occurs in the kernel or
userspace.  Later patches will add handling for specific #VE scenarios.

Convert unhandled #VE's (everything, until later in this series) so that
they appear just like a #GP by calling ve_raise_fault() directly.
ve_raise_fault() is similar to #GP handler and is responsible for
sending SIGSEGV to userspace and cpu die and notifying debuggers and
other die chain users.  

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-fix:
 * No code changes (Added Tony to "To" list)

Changes since v1:
 * Removed [RFC v2 07/32] x86/traps: Add do_general_protection() helper function.
 * Instead of resuing #GP handler, defined a custom handler.
 * Fixed commit log as per review comments.

 arch/x86/include/asm/idtentry.h |  4 ++
 arch/x86/include/asm/tdx.h      | 19 +++++++++
 arch/x86/kernel/idt.c           |  6 +++
 arch/x86/kernel/tdx.c           | 36 +++++++++++++++++
 arch/x86/kernel/traps.c         | 69 +++++++++++++++++++++++++++++++++
 5 files changed, 134 insertions(+)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5eb3bdf36a41..41a0732d5f68 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -619,6 +619,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER,	exc_xen_hypervisor_callback);
 DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER,	exc_xen_unknown_trap);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE,		exc_virtualization_exception);
+#endif
+
 /* Device interrupts common/spurious */
 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	common_interrupt);
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index fcd42119a287..a451786496a0 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -39,6 +39,25 @@ struct tdx_hypercall_output {
 	u64 r15;
 };
 
+/*
+ * Used by #VE exception handler to gather the #VE exception
+ * info from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct ve_info {
+	u64 exit_reason;
+	u64 exit_qual;
+	u64 gla;
+	u64 gpa;
+	u32 instr_len;
+	u32 instr_info;
+};
+
+unsigned long tdg_get_ve_info(struct ve_info *ve);
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+		struct ve_info *ve);
+
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..546b6b636c7d 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = {
 	 */
 	INTG(X86_TRAP_PF,		asm_exc_page_fault),
 #endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
+#endif
 };
 
 /*
@@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = {
 	INTG(X86_TRAP_MF,		asm_exc_coprocessor_error),
 	INTG(X86_TRAP_AC,		asm_exc_alignment_check),
 	INTG(X86_TRAP_XF,		asm_exc_simd_coprocessor_error),
+#ifdef CONFIG_INTEL_TDX_GUEST
+	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
+#endif
 
 #ifdef CONFIG_X86_32
 	TSKG(X86_TRAP_DF,		GDT_ENTRY_DOUBLEFAULT_TSS),
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e4383b416ef3..527d2638ddae 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -10,6 +10,7 @@
 
 /* TDX Module call Leaf IDs */
 #define TDINFO				1
+#define TDGETVEINFO			3
 
 static struct {
 	unsigned int gpa_width;
@@ -87,6 +88,41 @@ static void tdg_get_info(void)
 	td_info.attributes = out.rdx;
 }
 
+unsigned long tdg_get_ve_info(struct ve_info *ve)
+{
+	u64 ret;
+	struct tdx_module_output out = {0};
+
+	/*
+	 * NMIs and machine checks are suppressed. Before this point any
+	 * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
+	 * additional #VEs are permitted (but we don't expect them to
+	 * happen unless you panic).
+	 */
+	ret = __tdx_module_call(TDGETVEINFO, 0, 0, 0, 0, &out);
+
+	ve->exit_reason = out.rcx;
+	ve->exit_qual   = out.rdx;
+	ve->gla         = out.r8;
+	ve->gpa         = out.r9;
+	ve->instr_len   = out.r10 & UINT_MAX;
+	ve->instr_info  = out.r10 >> 32;
+
+	return ret;
+}
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+		struct ve_info *ve)
+{
+	/*
+	 * TODO: Add handler support for various #VE exit
+	 * reasons. It will be added by other patches in
+	 * the series.
+	 */
+	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+	return -EFAULT;
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 651e3e508959..043608943c3b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/vdso.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_X86_64
 #include <asm/x86_init.h>
@@ -1137,6 +1138,74 @@ DEFINE_IDTENTRY(exc_device_not_available)
 	}
 }
 
+#define VEFSTR "VE fault"
+static void ve_raise_fault(struct pt_regs *regs, long error_code)
+{
+	struct task_struct *tsk = current;
+
+	if (user_mode(regs)) {
+		tsk->thread.error_code = error_code;
+		tsk->thread.trap_nr = X86_TRAP_VE;
+
+		/*
+		 * Not fixing up VDSO exceptions similar to #GP handler
+		 * because we don't expect the VDSO to trigger #VE.
+		 */
+		show_signal(tsk, SIGSEGV, "", VEFSTR, regs, error_code);
+		force_sig(SIGSEGV);
+		return;
+	}
+
+	if (fixup_exception(regs, X86_TRAP_VE, error_code, 0))
+		return;
+
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_VE;
+
+	/*
+	 * To be potentially processing a kprobe fault and to trust the result
+	 * from kprobe_running(), we have to be non-preemptible.
+	 */
+	if (!preemptible() &&
+	    kprobe_running() &&
+	    kprobe_fault_handler(regs, X86_TRAP_VE))
+		return;
+
+	notify_die(DIE_GPF, VEFSTR, regs, error_code, X86_TRAP_VE, SIGSEGV);
+
+	die_addr(VEFSTR, regs, error_code, 0);
+}
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+	struct ve_info ve;
+	int ret;
+
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+
+	/*
+	 * NMIs/Machine-checks/Interrupts will be in a disabled state
+	 * till TDGETVEINFO TDCALL is executed. This prevents #VE
+	 * nesting issue.
+	 */
+	ret = tdg_get_ve_info(&ve);
+
+	cond_local_irq_enable(regs);
+
+	if (!ret)
+		ret = tdg_handle_virtualization_exception(regs, &ve);
+	/*
+	 * If tdg_handle_virtualization_exception() could not process
+	 * it successfully, treat it as #GP(0) and handle it.
+	 */
+	if (ret)
+		ve_raise_fault(regs, 0);
+
+	cond_local_irq_disable(regs);
+}
+#endif
+
 #ifdef CONFIG_X86_32
 DEFINE_IDTENTRY_SW(iret_error)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-05-19 20:49                         ` Andi Kleen
@ 2021-05-27  0:30                           ` Kuppuswamy Sathyanarayanan
  2021-05-27 15:25                             ` Luck, Tony
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-27  0:30 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson,
	Kuppuswamy Sathyanarayanan, linux-kernel

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
expose guest state to the host.  This prevents the old hypercall
mechanisms from working. So to communicate with VMM, TDX
specification defines a new instruction called "tdcall".

In a TDX based VM, since VMM is an untrusted entity, a intermediary
layer (TDX module) exists between host and guest to facilitate the
secure communication. TDX guests communicate with the TDX module and
with the VMM using a new instruction: TDCALL.

Implement common helper functions to communicate with the TDX Module
and VMM (using TDCALL instruction).
   
__tdx_hypercall()    - request services from the VMM.
__tdx_module_call()  - communicate with the TDX Module.

Also define two additional wrappers, tdx_hypercall() and
tdx_hypercall_out_r11() to cover common use cases of
__tdx_hypercall() function. Since each use case of
__tdx_module_call() is different, it does not need
multiple wrappers.

Implement __tdx_module_call() and __tdx_hypercall() helper functions
in assembly.

Rationale behind choosing to use assembly over inline assembly is,
since the number of lines of instructions (with comments) in
__tdx_hypercall() implementation is over 70, using inline assembly
to implement it will make it hard to read.
   
Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
use the same set of argument registers. The implementation here picks
the current worst-case scenario for TDCALL (4 registers). For TDCALLs
with fewer than 4 arguments, there will end up being a few superfluous
(cheap) instructions.  But, this approach maximizes code reuse. The
same argument applies to __tdx_hypercall() function as well.

For registers used by TDCALL instruction, please check TDX GHCI
specification, sec 2.4 and 3.

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Originally-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-fix-v1:
 * No code changes (Adding Tony to "To" list)

Changes since RFC v2: 
 * Renamed __tdcall()/__tdvmcall() to __tdx_module_call()/__tdx_hypercall().
 * Renamed reg offsets from TDCALL_rx to TDX_MODULE_rx.
 * Renamed reg offsets from TDVMCALL_rx to TDX_HYPERCALL_rx.
 * Renamed struct tdcall_output to struct tdx_module_output.
 * Renamed struct tdvmcall_output to struct tdx_hypercall_output.
 * Used BIT() to derive TDVMCALL_EXPOSE_REGS_MASK.
 * Removed unnecessary push/pop sequence in __tdcall() function.
 * Fixed comments as per Dave's review.

 arch/x86/include/asm/tdx.h    |  38 ++++++
 arch/x86/kernel/Makefile      |   2 +-
 arch/x86/kernel/asm-offsets.c |  22 ++++
 arch/x86/kernel/tdcall.S      | 223 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c         |  38 ++++++
 5 files changed, 322 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..fcd42119a287 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,50 @@
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
+#include <linux/types.h>
+
+/*
+ * Used in __tdx_module_call() helper function to gather the
+ * output registers' values of TDCALL instruction when requesting
+ * services from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_module_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+/*
+ * Used in __tdx_hypercall() helper function to gather the
+ * output registers' values of TDCALL instruction when requesting
+ * services from the VMM. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_hypercall_output {
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+};
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
 void __init tdx_early_init(void);
 
+/* Helper function used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		      struct tdx_module_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+		    struct tdx_hypercall_output *out);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..70cafbae4fea 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
 #include <xen/interface/xen.h>
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
 #ifdef CONFIG_X86_32
 # include "asm-offsets_32.c"
 #else
@@ -75,6 +79,24 @@ static void __used common(void)
 	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+	BLANK();
+	/* Offset for fields in tdx_module_output */
+	OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+	OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+	OFFSET(TDX_MODULE_r8,  tdx_module_output, r8);
+	OFFSET(TDX_MODULE_r9,  tdx_module_output, r9);
+	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
+	/* Offset for fields in tdx_hypercall_output */
+	OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
+	OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
+	OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
+	OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
+	OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
+#endif
+
 	BLANK();
 	OFFSET(BP_scratch, boot_params, scratch);
 	OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..b06e8b62dfe2
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,223 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+
+#define TDG_R10		BIT(10)
+#define TDG_R11		BIT(11)
+#define TDG_R12		BIT(12)
+#define TDG_R13		BIT(13)
+#define TDG_R14		BIT(14)
+#define TDG_R15		BIT(15)
+
+/*
+ * Expose registers R10-R15 to VMM. It is passed via RCX register
+ * to the TDX Module, which will be used by the TDX module to
+ * identify the list of registers exposed to VMM. Each bit in this
+ * mask represents a register ID. You can find the bit field details
+ * in TDX GHCI specification.
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK	( TDG_R10 | TDG_R11 | \
+					  TDG_R12 | TDG_R13 | \
+					  TDG_R14 | TDG_R15 )
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdx_module_call()  - Helper function used by TDX guests to request
+ * services from the TDX module (does not include VMM services).
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with the
+ * TDX module. If the "tdcall" operation is successful and a valid
+ * "struct tdx_module_output" pointer is available (in "out" argument),
+ * output from the TDX module is saved to the memory specified in the
+ * "out" pointer. Also the status of the "tdcall" operation is returned
+ * back to the user as a function return value.
+ *
+ * @fn  (RDI)		- TDCALL Leaf ID,    moved to RAX
+ * @rcx (RSI)		- Input parameter 1, moved to RCX
+ * @rdx (RDX)		- Input parameter 2, moved to RDX
+ * @r8  (RCX)		- Input parameter 3, moved to R8
+ * @r9  (R8)		- Input parameter 4, moved to R9
+ *
+ * @out (R9)		- struct tdx_module_output pointer
+ *			  stored temporarily in R12 (not
+ * 			  shared with the TDX module)
+ *
+ * Return status of tdcall via RAX.
+ *
+ * NOTE: This function should not be used for TDX hypercall
+ *       use cases.
+ */
+SYM_FUNC_START(__tdx_module_call)
+	FRAME_BEGIN
+
+	/*
+	 * R12 will be used as temporary storage for
+	 * struct tdx_module_output pointer. You can
+	 * find struct tdx_module_output details in
+	 * arch/x86/include/asm/tdx.h. Also note that
+	 * registers R12-R15 are not used by TDCALL
+	 * services supported by this helper function.
+	 */
+	push %r12	/* Callee saved, so preserve it */
+	mov %r9,  %r12 	/* Move output pointer to R12 */
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	mov %rdi, %rax	/* Move TDCALL Leaf ID to RAX */
+	mov %r8,  %r9	/* Move input 4 to R9 */
+	mov %rcx, %r8	/* Move input 3 to R8 */
+	mov %rsi, %rcx	/* Move input 1 to RCX */
+	/* Leave input param 2 in RDX */
+
+	tdcall
+
+	/* Check for TDCALL success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check if caller provided an output struct */
+	test %r12, %r12
+	jz 1f
+
+	/* Copy TDCALL result registers to output struct: */
+	movq %rcx, TDX_MODULE_rcx(%r12)
+	movq %rdx, TDX_MODULE_rdx(%r12)
+	movq %r8,  TDX_MODULE_r8(%r12)
+	movq %r9,  TDX_MODULE_r9(%r12)
+	movq %r10, TDX_MODULE_r10(%r12)
+	movq %r11, TDX_MODULE_r11(%r12)
+1:
+	pop %r12 /* Restore the state of R12 register */
+
+	FRAME_END
+	ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * do_tdx_hypercall()  - Helper function used by TDX guests to request
+ * services from the VMM. All requests are made via the TDX module
+ * using "TDCALL" instruction.
+ *
+ * This function is created to contain common code between vendor
+ * specific and standard type TDX hypercalls. So the caller of this
+ * function had to set the TDVMCALL type in the R10 register before
+ * calling it.
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with VMM
+ * via the TDX module. If the "tdcall" operation is successful and a
+ * valid "struct tdx_hypercall_output" pointer is available (in "out"
+ * argument), output from the VMM is saved to the memory specified in the
+ * "out" pointer. 
+ *
+ * @fn  (RDI)		- TDVMCALL function, moved to R11
+ * @r12 (RSI)		- Input parameter 1, moved to R12
+ * @r13 (RDX)		- Input parameter 2, moved to R13
+ * @r14 (RCX)		- Input parameter 3, moved to R14
+ * @r15 (R8)		- Input parameter 4, moved to R15
+ *
+ * @out (R9)		- struct tdx_hypercall_output pointer
+ *
+ * On successful completion, return TDX hypercall error code.
+ *
+ */
+SYM_FUNC_START_LOCAL(do_tdx_hypercall)
+	/* Save non-volatile GPRs that are exposed to the VMM. */
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+
+	/* Leave hypercall output pointer in R9, it's not clobbered by VMM */
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	xor %eax, %eax /* Move TDCALL leaf ID (TDVMCALL (0)) to RAX */
+	mov %rdi, %r11 /* Move TDVMCALL function id to R11 */
+	mov %rsi, %r12 /* Move input 1 to R12 */
+	mov %rdx, %r13 /* Move input 2 to R13 */
+	mov %rcx, %r14 /* Move input 1 to R14 */
+	mov %r8,  %r15 /* Move input 1 to R15 */
+	/* Caller of do_tdx_hypercall() will set TDVMCALL type in R10 */
+
+	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+	tdcall
+
+	/*
+	 * Non-zero RAX values indicate a failure of TDCALL itself.
+	 * Panic for those.  This value is unrelated to the hypercall
+	 * result in R10.
+	 */
+	test %rax, %rax
+	jnz 2f
+
+	/* Move hypercall error code to RAX to return to user */
+	mov %r10, %rax
+
+	/* Check for hypercall success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check if caller provided an output struct */
+	test %r9, %r9
+	jz 1f
+
+	/* Copy hypercall result registers to output struct: */
+	movq %r11, TDX_HYPERCALL_r11(%r9)
+	movq %r12, TDX_HYPERCALL_r12(%r9)
+	movq %r13, TDX_HYPERCALL_r13(%r9)
+	movq %r14, TDX_HYPERCALL_r14(%r9)
+	movq %r15, TDX_HYPERCALL_r15(%r9)
+1:
+	/*
+	 * Zero out registers exposed to the VMM to avoid
+	 * speculative execution with VMM-controlled values.
+	 * This needs to include all registers present in
+	 * TDVMCALL_EXPOSE_REGS_MASK.
+	 */
+	xor %r10d, %r10d
+	xor %r11d, %r11d
+	xor %r12d, %r12d
+	xor %r13d, %r13d
+	xor %r14d, %r14d
+	xor %r15d, %r15d
+
+	/* Restore non-volatile GPRs that are exposed to the VMM. */
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	ret
+2:
+	ud2
+SYM_FUNC_END(do_tdx_hypercall)
+
+/*
+ * Helper function for standard type of TDVMCALLs. This assembly
+ * wrapper reuses do_tdvmcall() for standard type of hypercalls
+ * (R10 is set as zero).
+ */
+SYM_FUNC_START(__tdx_hypercall)
+	FRAME_BEGIN
+	/*
+	 * R10 is not part of the function call ABI, but it is a part
+	 * of the TDVMCALL ABI. So set it 0 for standard type TDVMCALL
+	 * before making call to the do_tdx_hypercall().
+	 */
+	xor %r10, %r10
+	call do_tdx_hypercall
+	FRAME_END
+	retq
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 5e70617e9877..97b54317f799 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -1,8 +1,46 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (C) 2020 Intel Corporation */
 
+#define pr_fmt(fmt) "TDX: " fmt
+
 #include <asm/tdx.h>
 
+/*
+ * Wrapper for simple hypercalls that only return a success/error code.
+ */
+static inline u64 tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+	u64 err;
+
+	err = __tdx_hypercall(fn, r12, r13, r14, r15, NULL);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return err;
+}
+
+/*
+ * Wrapper for the semi-common case where user need single output
+ * value (R11). Callers of this function does not care about the
+ * hypercall error code (mainly for IN or MMIO usecase).
+ */
+static inline u64 tdx_hypercall_out_r11(u64 fn, u64 r12, u64 r13,
+					u64 r14, u64 r15)
+{
+	struct tdx_hypercall_output out = {0};
+	u64 err;
+
+	err = __tdx_hypercall(fn, r12, r13, r14, r15, &out);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return out.r11;
+}
+
 static inline bool cpuid_has_tdx_guest(void)
 {
 	u32 eax, signature[3];
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v1 0/3] x86/tdx: Handle port I/O
  2021-05-12  6:17       ` Dan Williams
@ 2021-05-27  4:23         ` Kuppuswamy Sathyanarayanan
  2021-05-27  4:23           ` [RFC v2-fix-v1 1/3] tdx: Introduce generic protected_guest abstraction Kuppuswamy Sathyanarayanan
                             ` (2 more replies)
  0 siblings, 3 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-27  4:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Kuppuswamy Sathyanarayanan,
	linux-kernel

This patchset addresses the review comments in the patch titled
"[RFC v2 14/32] x86/tdx: Handle port I/O". Since it requires
patch split, sending these together.

Changes since RFC v2:
 * Removed assembly implementation of port IO emulation code
   and modified __in/__out IO helpers to directly call C function
   for in/out instruction emulation in decompression code.
 * Added helper function tdx_get_iosize() to make it easier for
   calling tdg_out/tdg_int() C functions from decompression code.
 * Added support for early exception handler to support IO
   instruction emulation in early boot kernel code.
 * Removed alternative_ usage and made kernel only use #VE based
   IO instruction emulation support outside the decompression module.
 * Added support for protection_guest_has() API to generalize
   AMD SEV/TDX specific initialization code in common drivers.
 * Fixed commit log and comments as per review comments.


Andi Kleen (1):
  x86/tdx: Handle early IO operations

Kirill A. Shutemov (1):
  x86/tdx: Handle port I/O

Kuppuswamy Sathyanarayanan (1):
  tdx: Introduce generic protected_guest abstraction

 arch/Kconfig                           |   3 +
 arch/x86/Kconfig                       |   1 +
 arch/x86/boot/compressed/Makefile      |   1 +
 arch/x86/boot/compressed/tdcall.S      |   3 +
 arch/x86/boot/compressed/tdx.c         |  28 ++++++
 arch/x86/include/asm/io.h              |   7 +-
 arch/x86/include/asm/protected_guest.h |  24 +++++
 arch/x86/include/asm/tdx.h             |  60 ++++++++++++-
 arch/x86/kernel/head64.c               |   4 +
 arch/x86/kernel/tdx.c                  | 116 +++++++++++++++++++++++++
 include/linux/protected_guest.h        |  23 +++++
 11 files changed, 267 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdcall.S
 create mode 100644 arch/x86/include/asm/protected_guest.h
 create mode 100644 include/linux/protected_guest.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v1 1/3] tdx: Introduce generic protected_guest abstraction
  2021-05-27  4:23         ` [RFC v2-fix-v1 0/3] " Kuppuswamy Sathyanarayanan
@ 2021-05-27  4:23           ` Kuppuswamy Sathyanarayanan
  2021-06-01 21:14             ` [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction Kuppuswamy Sathyanarayanan
  2021-05-27  4:23           ` [RFC v2-fix-v1 2/3] x86/tdx: Handle early IO operations Kuppuswamy Sathyanarayanan
  2021-05-27  4:23           ` [RFC v2-fix-v1 3/3] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
  2 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-27  4:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Kuppuswamy Sathyanarayanan,
	linux-kernel

Add a generic way to check if we run with an encrypted guest,
without requiring x86 specific ifdefs. This can then be used in
non architecture specific code. 

is_protected_guest() helper function can be implemented using
arch specific CPU feature flags.

protected_guest_has() is used to check for protected guest
feature flags.

Originally-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/Kconfig                           |  3 +++
 arch/x86/Kconfig                       |  1 +
 arch/x86/include/asm/protected_guest.h | 24 ++++++++++++++++++++++++
 arch/x86/include/asm/tdx.h             |  7 +++++++
 arch/x86/kernel/tdx.c                  | 18 ++++++++++++++++++
 include/linux/protected_guest.h        | 23 +++++++++++++++++++++++
 6 files changed, 76 insertions(+)
 create mode 100644 arch/x86/include/asm/protected_guest.h
 create mode 100644 include/linux/protected_guest.h

diff --git a/arch/Kconfig b/arch/Kconfig
index ecfd3520b676..98c30312555b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -956,6 +956,9 @@ config HAVE_ARCH_NVRAM_OPS
 config ISA_BUS_API
 	def_bool ISA
 
+config ARCH_HAS_PROTECTED_GUEST
+	bool
+
 #
 # ABI hall of shame
 #
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 15e66a99dd41..fc588a64d1a0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -878,6 +878,7 @@ config INTEL_TDX_GUEST
 	select PARAVIRT_XL
 	select X86_X2APIC
 	select SECURITY_LOCKDOWN_LSM
+	select ARCH_HAS_PROTECTED_GUEST
 	help
 	  Provide support for running in a trusted domain on Intel processors
 	  equipped with Trusted Domain eXtenstions. TDX is a new Intel
diff --git a/arch/x86/include/asm/protected_guest.h b/arch/x86/include/asm/protected_guest.h
new file mode 100644
index 000000000000..b2838e58ce94
--- /dev/null
+++ b/arch/x86/include/asm/protected_guest.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_PROTECTED_GUEST
+#define _ASM_PROTECTED_GUEST 1
+
+#include <asm/cpufeature.h>
+#include <asm/tdx.h>
+
+/* Only include through linux/protected_guest.h */
+
+static inline bool is_protected_guest(void)
+{
+	return boot_cpu_has(X86_FEATURE_TDX_GUEST);
+}
+
+static inline bool protected_guest_has(unsigned long flag)
+{
+	if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
+		return tdx_protected_guest_has(flag);
+
+	return false;
+}
+
+#endif
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 597a3e1663d7..53f844200909 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -71,6 +71,8 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
 		    struct tdx_hypercall_output *out);
 
+bool tdx_protected_guest_has(unsigned long flag);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
@@ -80,6 +82,11 @@ static inline bool is_tdx_guest(void)
 
 static inline void tdx_early_init(void) { };
 
+static inline bool tdx_protected_guest_has(unsigned long flag)
+{
+	return false;
+}
+
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #ifdef CONFIG_INTEL_TDX_GUEST_KVM
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 17725646eb30..858e7f3d8f36 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,7 @@
 #include <asm/vmx.h>
 
 #include <linux/cpu.h>
+#include <linux/protected_guest.h>
 
 /* TDX Module call Leaf IDs */
 #define TDINFO				1
@@ -75,6 +76,23 @@ bool is_tdx_guest(void)
 }
 EXPORT_SYMBOL_GPL(is_tdx_guest);
 
+bool tdx_protected_guest_has(unsigned long flag)
+{
+	if (!is_tdx_guest())
+		return false;
+
+	switch (flag) {
+	case VM_MEM_ENCRYPT:
+	case VM_MEM_ENCRYPT_ACTIVE:
+	case VM_UNROLL_STRING_IO:
+	case VM_HOST_MEM_ENCRYPT:
+		return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(tdx_protected_guest_has);
+
 static void tdg_get_info(void)
 {
 	u64 ret;
diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
new file mode 100644
index 000000000000..f362eea39bd8
--- /dev/null
+++ b/include/linux/protected_guest.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_PROTECTED_GUEST_H
+#define _LINUX_PROTECTED_GUEST_H 1
+
+/* Protected Guest Feature Flags (leave 0-0xff for arch specific flags) */
+
+/* Support for guest encryption */
+#define VM_MEM_ENCRYPT			0x100
+/* Encryption support is active */
+#define VM_MEM_ENCRYPT_ACTIVE		0x101
+/* Support for unrolled string IO */
+#define VM_UNROLL_STRING_IO		0x102
+/* Support for host memory encryption */
+#define VM_HOST_MEM_ENCRYPT		0x103
+
+#ifdef CONFIG_ARCH_HAS_PROTECTED_GUEST
+#include <asm/protected_guest.h>
+#else
+static inline bool is_protected_guest(void) { return false; }
+static inline bool protected_guest_has(unsigned long flag) { return false; }
+#endif
+
+#endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v1 2/3] x86/tdx: Handle early IO operations
  2021-05-27  4:23         ` [RFC v2-fix-v1 0/3] " Kuppuswamy Sathyanarayanan
  2021-05-27  4:23           ` [RFC v2-fix-v1 1/3] tdx: Introduce generic protected_guest abstraction Kuppuswamy Sathyanarayanan
@ 2021-05-27  4:23           ` Kuppuswamy Sathyanarayanan
  2021-06-05  4:26             ` Williams, Dan J
  2021-05-27  4:23           ` [RFC v2-fix-v1 3/3] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
  2 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-27  4:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Kuppuswamy Sathyanarayanan,
	linux-kernel

From: Andi Kleen <ak@linux.intel.com>

Add an early #VE handler to convert early port IOs into TDCALLs.

TDX cannot do port IO directly. The TDX module triggers a #VE
exception to let the guest kernel to emulate operations like
IO ports, by converting them into TDCALLs to call the host.

A fully featured #VE handler support for port IO will be added
later in this patch set (in patch titled "x86/tdx: Handle port
I/O). But it can be used only at later point in the boot
process. So to support port IO in early boot code, add a
minimal support in early exception handler. This is similar to
what AMD SEV does.

This is mainly to support early_printk's serial driver, as
well as potentially the VGA driver (although it is expected
not to be used).

The early handler only does IO calls and nothing else, and
anything that goes wrong results in a normal early exception
panic.

It cannot share the code paths with the normal #VE handler
because it needs to avoid using trace calls or printk.

This early handler allows us to use the normal in*/out*
macros without patching them for every driver. We don't
expect IO port IO to be performance critical at all, so an
extra #VE exception is no problem. There are also no concerns
with nesting, since there should be no NMIs this early.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/tdx.h |  6 ++++
 arch/x86/kernel/head64.c   |  4 +++
 arch/x86/kernel/tdx.c      | 59 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 53f844200909..e880a9dd40d3 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -72,6 +72,7 @@ u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
 		    struct tdx_hypercall_output *out);
 
 bool tdx_protected_guest_has(unsigned long flag);
+bool tdg_early_handle_ve(struct pt_regs *regs);
 
 #else // !CONFIG_INTEL_TDX_GUEST
 
@@ -87,6 +88,11 @@ static inline bool tdx_protected_guest_has(unsigned long flag)
 	return false;
 }
 
+static inline bool tdg_early_handle_ve(struct pt_regs *regs)
+{
+	return false;
+}
+
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #ifdef CONFIG_INTEL_TDX_GUEST_KVM
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 75f2401cb5db..23d1ff4626aa 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -410,6 +410,10 @@ void __init do_early_exception(struct pt_regs *regs, int trapnr)
 	    trapnr == X86_TRAP_VC && handle_vc_boot_ghcb(regs))
 		return;
 
+	if (IS_ENABLED(CONFIG_INTEL_TDX_GUEST) &&
+	    trapnr == X86_TRAP_VE && tdg_early_handle_ve(regs))
+		return;
+
 	early_fixup_exception(regs, trapnr);
 }
 
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 858e7f3d8f36..ca3442b7accf 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -13,6 +13,10 @@
 #define TDINFO				1
 #define TDGETVEINFO			3
 
+#define VE_GET_IO_TYPE(exit_qual)      (((exit_qual) & 8) ? 0 : 1)
+#define VE_GET_IO_SIZE(exit_qual)      (((exit_qual) & 7) + 1)
+#define VE_GET_PORT_NUM(exit_qual)     ((exit_qual) >> 16)
+
 static struct {
 	unsigned int gpa_width;
 	unsigned long attributes;
@@ -256,6 +260,61 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	return ret;
 }
 
+/*
+ * Handle early IO, mainly for early printks serial output.
+ * This avoids anything that doesn't work early on, like tracing
+ * or printks, by calling the low level functions directly. Any
+ * problems are handled by falling back to a standard early exception.
+ *
+ * Assumes the IO instruction was using ax, which is enforced
+ * by the standard io.h macros.
+ */
+static __init bool tdx_early_io(struct ve_info *ve, struct pt_regs *regs)
+{
+	struct tdx_hypercall_output outh;
+	int out = VE_GET_IO_TYPE(ve->exit_qual);
+	int size = VE_GET_IO_SIZE(ve->exit_qual);
+	int port = VE_GET_PORT_NUM(ve->exit_qual);
+	int ret;
+
+	if (out) {
+		ret = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION,
+				      size, 1, port,
+				      regs->ax,
+				      &outh);
+	} else {
+		u64 mask = GENMASK(8 * size, 0);
+
+		ret = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION,
+				      size, 0, port,
+				      regs->ax, &outh);
+		if (!ret) {
+			regs->ax &= ~mask;
+			regs->ax |= outh.r11 & mask;
+		}
+	}
+
+	return !ret;
+}
+
+/*
+ * Early #VE exception handler. Just used to handle port IOs
+ * for early_printk. If anything goes wrong handle it like
+ * a normal early exception.
+ */
+__init bool tdg_early_handle_ve(struct pt_regs *regs)
+{
+	struct ve_info ve;
+
+	if (tdg_get_ve_info(&ve))
+		return false;
+
+	if (ve.exit_reason == EXIT_REASON_IO_INSTRUCTION)
+		return tdx_early_io(&ve, regs);
+
+	return false;
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v1 3/3] x86/tdx: Handle port I/O
  2021-05-27  4:23         ` [RFC v2-fix-v1 0/3] " Kuppuswamy Sathyanarayanan
  2021-05-27  4:23           ` [RFC v2-fix-v1 1/3] tdx: Introduce generic protected_guest abstraction Kuppuswamy Sathyanarayanan
  2021-05-27  4:23           ` [RFC v2-fix-v1 2/3] x86/tdx: Handle early IO operations Kuppuswamy Sathyanarayanan
@ 2021-05-27  4:23           ` Kuppuswamy Sathyanarayanan
  2021-06-05 18:52             ` Dan Williams
  2 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-27  4:23 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Kuppuswamy Sathyanarayanan,
	linux-kernel

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

TDX hypervisors cannot emulate instructions directly. This
includes port IO which is normally emulated in the hypervisor.
All port IO instructions inside TDX trigger the #VE exception
in the guest and would be normally emulated there.

For the really early code in the decompressor, #VE cannot be
used because the IDT needed for handling the exception is not
set-up, and some other infrastructure needed by the handler
is missing. So to support port IO in decompressor code, add
support for paravirt based I/O port virtualization.

Also string I/O is not supported in TDX guest. So, unroll the
string I/O operation into a loop operating on one element at
a time. This method is similar to AMD SEV, so just extend the
support for TDX guest platform.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile |  1 +
 arch/x86/boot/compressed/tdcall.S |  3 ++
 arch/x86/boot/compressed/tdx.c    | 28 ++++++++++++++++++
 arch/x86/include/asm/io.h         |  7 +++--
 arch/x86/include/asm/tdx.h        | 47 ++++++++++++++++++++++++++++++-
 arch/x86/kernel/tdx.c             | 39 +++++++++++++++++++++++++
 6 files changed, 122 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdcall.S

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index a2554621cefe..a944a2038797 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -97,6 +97,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdcall.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
new file mode 100644
index 000000000000..aafadc136c88
--- /dev/null
+++ b/arch/x86/boot/compressed/tdcall.S
@@ -0,0 +1,3 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include "../../kernel/tdcall.S"
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index 0a87c1775b67..cb20962c7da6 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -4,6 +4,8 @@
  */
 
 #include <asm/tdx.h>
+#include <asm/vmx.h>
+#include <vdso/limits.h>
 
 static int __ro_after_init tdx_guest = -1;
 
@@ -30,3 +32,29 @@ bool is_tdx_guest(void)
 	return !!tdx_guest;
 }
 
+/*
+ * Helper function used for making hypercall for "out"
+ * instruction. It will be called from __out IO
+ * macro (in tdx.h).
+ */
+void tdg_out(int size, int port, unsigned int value)
+{
+	__tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 1,
+			port, value, NULL);
+}
+
+/*
+ * Helper function used for making hypercall for "in"
+ * instruction. It will be called from __in IO macro
+ * (in tdx.h). If IO is failed, it will return all 1s.
+ */
+unsigned int tdg_in(int size, int port)
+{
+	struct tdx_hypercall_output out = {0};
+	int err;
+
+	err = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 0,
+			      port, 0, &out);
+
+	return err ? UINT_MAX : out.r11;
+}
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index ef7a686a55a9..daa75c8eef5d 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -40,6 +40,7 @@
 
 #include <linux/string.h>
 #include <linux/compiler.h>
+#include <linux/protected_guest.h>
 #include <asm/page.h>
 #include <asm/early_ioremap.h>
 #include <asm/pgtable_types.h>
@@ -309,7 +310,8 @@ static inline unsigned type in##bwl##_p(int port)			\
 									\
 static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 {									\
-	if (sev_key_active()) {						\
+	if (sev_key_active() ||						\
+	    protected_guest_has(VM_UNROLL_STRING_IO)) {			\
 		unsigned type *value = (unsigned type *)addr;		\
 		while (count) {						\
 			out##bwl(*value, port);				\
@@ -325,7 +327,8 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 									\
 static inline void ins##bwl(int port, void *addr, unsigned long count)	\
 {									\
-	if (sev_key_active()) {						\
+	if (sev_key_active() ||						\
+	    protected_guest_has(VM_UNROLL_STRING_IO)) {			\
 		unsigned type *value = (unsigned type *)addr;		\
 		while (count) {						\
 			*value = in##bwl(port);				\
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index e880a9dd40d3..6ba2dcea533f 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -5,6 +5,8 @@
 
 #define TDX_CPUID_LEAF_ID	0x21
 
+#ifndef __ASSEMBLY__
+
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
@@ -74,6 +76,48 @@ u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
 bool tdx_protected_guest_has(unsigned long flag);
 bool tdg_early_handle_ve(struct pt_regs *regs);
 
+void tdg_out(int size, int port, unsigned int value);
+unsigned int tdg_in(int size, int port);
+
+/* Helper function for converting {b,w,l} to byte size */
+static inline int tdx_get_iosize(char *str)
+{
+	if (str[0] == 'w')
+		return 2;
+	else if (str[0] == 'l')
+		return 4;
+
+	return 1;
+}
+
+/*
+ * To support I/O port access in decompressor or early kernel init
+ * code, since #VE exception handler cannot be used, use paravirt
+ * model to implement __in/__out macros which will in turn be used
+ * by in{b,w,l}()/out{b,w,l} I/O helper macros used in kernel. You
+ * can find the __in/__out macro usage in arch/x86/include/asm/io.h
+ */
+#ifdef BOOT_COMPRESSED_MISC_H
+#define __out(bwl, bw)							\
+do {									\
+	if (is_tdx_guest()) {						\
+		tdg_out(tdx_get_iosize(#bwl), port, value);		\
+	} else {							\
+		asm volatile("out" #bwl " %" #bw "0, %w1" : :		\
+				"a"(value), "Nd"(port));		\
+	}								\
+} while (0)
+#define __in(bwl, bw)							\
+do {									\
+	if (is_tdx_guest()) {						\
+		value = tdg_in(tdx_get_iosize(#bwl), port);		\
+	} else {							\
+		asm volatile("in" #bwl " %w1, %" #bw "0" :		\
+				"=a"(value) : "Nd"(port));		\
+	}								\
+} while (0)
+#endif
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
@@ -161,6 +205,7 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
 {
 	return -ENODEV;
 }
-#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
 
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
+#endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ca3442b7accf..4a84487ee8ff 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -202,6 +202,42 @@ static void tdg_handle_cpuid(struct pt_regs *regs)
 	regs->dx = out.r15;
 }
 
+void tdg_out(int size, int port, unsigned int value)
+{
+	tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 1, port, value);
+}
+
+unsigned int tdg_in(int size, int port)
+{
+	struct tdx_hypercall_output out = {0};
+	u64 err;
+
+	err = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 0,
+			      port, 0, &out);
+
+	return err ? UINT_MAX : out.r11;
+}
+
+static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
+{
+	bool string = exit_qual & 16;
+	int out, size, port;
+
+	/* I/O strings ops are unrolled at build time. */
+	BUG_ON(string);
+
+	out = VE_GET_IO_TYPE(exit_qual);
+	size = VE_GET_IO_SIZE(exit_qual);
+	port = VE_GET_PORT_NUM(exit_qual);
+
+	if (out) {
+		tdg_out(size, port, regs->ax);
+	} else {
+		regs->ax &= ~GENMASK(8 * size, 0);
+		regs->ax |= tdg_in(size, port) & GENMASK(8 * size, 0);
+	}
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -248,6 +284,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_CPUID:
 		tdg_handle_cpuid(regs);
 		break;
+	case EXIT_REASON_IO_INSTRUCTION:
+		tdg_handle_io(regs, ve->exit_qual);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v3 1/1] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-26  1:09                                             ` Andi Kleen
@ 2021-05-27  4:38                                               ` Kuppuswamy Sathyanarayanan
  2021-06-05  3:35                                                 ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-27  4:38 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Kuppuswamy Sathyanarayanan,
	linux-kernel

Functionally only devices outside the CPU (such as DMA devices,
or persistent memory for flushing) can notice the external side
effects from WBINVD's cache flushing for write back mappings. One
exception here is MKTME, but that is not visible outside the TDX
module and not possible inside a TDX guest.

Currently TDX does not support DMA, because DMA typically needs
uncached access for MMIO, and the current TDX module always sets
the IgnorePAT bit, which prevents that.

Persistent memory is also currently not supported. There are some
other cases that use WBINVD, such as the legacy ACPI sleeps, but
these are all not supported in virtualization and there are better
mechanisms inside a guest anyways. The guests usually are not
aware of power management. Another code path that uses WBINVD is
the MTRR driver, but EPT/virtualization always disables MTRRs so
those are not needed. This all implies WBINVD is not needed with
current TDX. 

So handle the WBINVD instruction as nop. Currently, #VE exception
handler does not include any warning for WBINVD handling because
ACPI reboot code uses it. This is the same behavior as KVM. It
only allows WBINVD in a guest when the guest supports VT-d (=DMA),
but just handles it as a nop if it doesn't .

If TDX ever gets DMA support, or persistent memory support, or
some other devices that can observe flushing side effects, a
hypercall can be added to implement it similar to AMD-SEV. But
current TDX does not need it.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-2:
 * Added more details to commit log and comments to address
   review comments.

Changes since RFC v2:
 * Fixed commit log as per review comments.
 * Removed WARN_ONCE for WBINVD #VE support.

 arch/x86/kernel/tdx.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index da5c9cd08299..775ae090b625 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -455,6 +455,13 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_EPT_VIOLATION:
 		ve->instr_len = tdg_handle_mmio(regs, ve);
 		break;
+	case EXIT_REASON_WBINVD:
+		/*
+		 * Non coherent DMA, persistent memory, MTRRs or
+		 * outdated ACPI sleeps are not supported in TDX guest.
+		 * So ignore WBINVD and treat it nop.
+		 */
+		break;
 	case EXIT_REASON_MONITOR_INSTRUCTION:
 	case EXIT_REASON_MWAIT_INSTRUCTION:
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-18  1:28             ` Kuppuswamy, Sathyanarayanan
@ 2021-05-27  4:46               ` Kuppuswamy, Sathyanarayanan
  2021-05-27  4:47                 ` [RFC v2-fix-v1 1/1] " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-27  4:46 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov
  Cc: Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 5/17/21 6:28 PM, Kuppuswamy, Sathyanarayanan wrote:
>> Because the code is already separate.  You're actually going to some
>> trouble to move the SEV-specific code and then combine it with the
>> TDX-specific code.
>>
>> Anyway, please just give it a shot.  Should take all of ten minutes.  If
>> it doesn't work out in practice, fine.  You'll have a good paragraph for
>> the changelog.
> 
> After reviewing the code again, I have noticed that we don't really have
> much common code between AMD and TDX. So I don't see any justification for
> creating this common layer. So, I have decided to drop this patch and move
> Intel TDX specific memory encryption init code to patch titled "[RFC v2 30/32]
> x86/tdx: Make DMA pages shared". This model is similar to how AMD-SEV
> does the initialization.
> 
> I have sent the modified patch as reply to patch titled "[RFC v2 30/32]
> x86/tdx: Make DMA pages shared". Please check and let me know your comment

My method of using separate initialization file for Intel only code will not
work if we want to support both AMD SEV and TDX guest support in same binary.
So please ignore my previous reply. I will address the issue as per your
original comments and send you an updated patch.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v1 1/1] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-27  4:46               ` Kuppuswamy, Sathyanarayanan
@ 2021-05-27  4:47                 ` Kuppuswamy Sathyanarayanan
  2021-06-01  2:10                   ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-27  4:47 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Kuppuswamy Sathyanarayanan,
	linux-kernel

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Intel TDX doesn't allow VMM to access guest private memory. Any
memory that is required for communication with VMM must be shared
explicitly by setting the bit in page table entry. After setting
the shared bit, the conversion must be completed with MapGPA TDVMALL.
The call informs VMM about the conversion between private/shared
mappings. The shared memory is similar to unencrypted memory in AMD
SME/SEV terminology but the underlying process of sharing/un-sharing
the memory is different for Intel TDX guest platform.

SEV assumes that I/O devices can only do DMA to "decrypted"
physical addresses without the C-bit set.  In order for the CPU
to interact with this memory, the CPU needs a decrypted mapping.
To add this support, AMD SME code forces force_dma_unencrypted()
to return true for platforms that support AMD SEV feature. It will
be used for DMA memory allocation API to trigger
set_memory_decrypted() for platforms that support AMD SEV feature.

TDX is similar. So, to communicate with I/O devices, related pages
need to be marked as shared. As mentioned above, shared memory in
TDX architecture is similar to decrypted memory in AMD SME/SEV. So
similar to AMD SEV, force_dma_unencrypted() has to forced to return
true. This support is added in other patches in this series.

So move force_dma_unencrypted() out of AMD specific code and call
AMD specific (amd_force_dma_unencrypted()) initialization function
from it. force_dma_unencrypted() will be modified by later patches
to include Intel TDX guest platform specific initialization.

Also, introduce new config option X86_MEM_ENCRYPT_COMMON that has
to be selected by all x86 memory encryption features. This will be
selected by both AMD SEV and Intel TDX guest config options.

This is preparation for TDX changes in DMA code and it has not
functional change.    

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2:
 * Instead of moving all the contents of force_dma_unencrypted() to
   mem_encrypt_common.c, create sub function for AMD and call it
   from common code.
 * Fixed commit log as per review comments.

 arch/x86/Kconfig                 |  8 ++++++--
 arch/x86/mm/Makefile             |  2 ++
 arch/x86/mm/mem_encrypt.c        |  4 ++--
 arch/x86/mm/mem_encrypt_common.c | 22 ++++++++++++++++++++++
 4 files changed, 32 insertions(+), 4 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fc588a64d1a0..7bc371d8ad7d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1531,14 +1531,18 @@ config X86_CPA_STATISTICS
 	  helps to determine the effectiveness of preserving large and huge
 	  page mappings when mapping protections are changed.
 
+config X86_MEM_ENCRYPT_COMMON
+	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+	select DYNAMIC_PHYSICAL_MASK
+	def_bool n
+
 config AMD_MEM_ENCRYPT
 	bool "AMD Secure Memory Encryption (SME) support"
 	depends on X86_64 && CPU_SUP_AMD
 	select DMA_COHERENT_POOL
-	select DYNAMIC_PHYSICAL_MASK
 	select ARCH_USE_MEMREMAP_PROT
-	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
 	select INSTRUCTION_DECODER
+	select X86_MEM_ENCRYPT_COMMON
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
 
+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON)	+= mem_encrypt_common.o
+
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index ae78cef79980..ae4f3924f98f 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -390,8 +390,8 @@ bool noinstr sev_es_active(void)
 	return sev_status & MSR_AMD64_SEV_ES_ENABLED;
 }
 
-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
+/* Override for DMA direct allocation check - AMD specific initialization */
+bool amd_force_dma_unencrypted(struct device *dev)
 {
 	/*
 	 * For SEV, all DMA must be to unencrypted addresses.
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..5ebf04482feb
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Memory Encryption Support Common Code
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
+ */
+
+#include <linux/mem_encrypt.h>
+#include <linux/dma-mapping.h>
+
+bool amd_force_dma_unencrypted(struct device *dev);
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+	if (sev_active() || sme_active())
+		return amd_force_dma_unencrypted(dev);
+
+	return false;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v1 1/1] x86/tdx: Add helper to do MapGPA hypercall
  2021-05-20 23:14     ` Kuppuswamy, Sathyanarayanan
@ 2021-05-27  4:56       ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-27  4:56 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Kuppuswamy Sathyanarayanan,
	linux-kernel

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

MapGPA hypercall is used by TDX guests to request VMM convert
the existing mapping of given GPA address range between
private/shared.

tdx_hcall_gpa_intent() is the wrapper used for making MapGPA
hypercall.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2:
  * Renamed tdg_map_gpa() to tdx_hcall_gpa_intent().
  * Fixed commit log and comments as per review comments.

 arch/x86/include/asm/tdx.h | 17 +++++++++++++++++
 arch/x86/kernel/tdx.c      | 24 ++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index a93528246595..eb9fa5f4d0e3 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -7,6 +7,15 @@
 
 #ifndef __ASSEMBLY__
 
+/*
+ * Page mapping type enum. This is software construct not
+ * part of any hardware or VMM ABI.
+ */
+enum tdx_map_type {
+	TDX_MAP_PRIVATE,
+	TDX_MAP_SHARED,
+};
+
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
@@ -119,6 +128,8 @@ do {									\
 #endif
 
 extern phys_addr_t tdg_shared_mask(void);
+extern int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
+				enum tdx_map_type map_type);
 
 #else // !CONFIG_INTEL_TDX_GUEST
 
@@ -143,6 +154,12 @@ static inline phys_addr_t tdg_shared_mask(void)
 {
 	return 0;
 }
+
+static inline int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
+				       enum tdx_map_type map_type)
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #ifdef CONFIG_INTEL_TDX_GUEST_KVM
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 0c8b10b78f32..a8ebd2d10093 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -16,6 +16,9 @@
 #define TDINFO				1
 #define TDGETVEINFO			3
 
+/* TDX hypercall Leaf IDs */
+#define TDVMCALL_MAP_GPA		0x10001
+
 #define VE_GET_IO_TYPE(exit_qual)      (((exit_qual) & 8) ? 0 : 1)
 #define VE_GET_IO_SIZE(exit_qual)      (((exit_qual) & 7) + 1)
 #define VE_GET_PORT_NUM(exit_qual)     ((exit_qual) >> 16)
@@ -122,6 +125,27 @@ static void tdg_get_info(void)
 	physical_mask &= ~tdg_shared_mask();
 }
 
+/*
+ * Inform the VMM of the guest's intent for this physical page:
+ * shared with the VMM or private to the guest.  The VMM is
+ * expected to change its mapping of the page in response.
+ *
+ * Note: shared->private conversions require further guest
+ * action to accept the page.
+ */
+int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
+			 enum tdx_map_type map_type)
+{
+	u64 ret;
+
+	if (map_type == TDX_MAP_SHARED)
+		gpa |= tdg_shared_mask();
+
+	ret = tdx_hypercall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0);
+
+	return ret ? -EIO : 0;
+}
+
 static __cpuidle void tdg_halt(void)
 {
 	u64 ret;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* RE: [RFC v2-fix-v2 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-27  0:29               ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
@ 2021-05-27 15:11                 ` Luck, Tony
  2021-05-27 16:24                   ` Sean Christopherson
  0 siblings, 1 reply; 381+ messages in thread
From: Luck, Tony @ 2021-05-27 15:11 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Hansen, Dave
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Williams, Dan J, Raj, Ashok, Sean Christopherson, linux-kernel

+struct ve_info {
+	u64 exit_reason;
+	u64 exit_qual;
+	u64 gla;
+	u64 gpa;
+	u32 instr_len;
+	u32 instr_info;
+};

I guess that "gla" = Guest Linear Address ... which is a very "Intel" way of
describing what everyone else would call a Guest Virtual Address.

I don't feel strongly about this though. If this has already been hashed
out already then stick with this name.

Otherwise:

Reviewed-by: Tony Luck <tony.luck@intel.com>

-Tony

^ permalink raw reply	[flat|nested] 381+ messages in thread

* RE: [RFC v2-fix-v2 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-05-27  0:30                           ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
@ 2021-05-27 15:25                             ` Luck, Tony
  2021-05-27 15:52                               ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Luck, Tony @ 2021-05-27 15:25 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Hansen, Dave
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Williams, Dan J, Raj, Ashok, Sean Christopherson, linux-kernel

> Guests communicate with VMMs with hypercalls. Historically, these
> are implemented using instructions that are known to cause VMEXITs
> like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
> expose guest state to the host.  This prevents the old hypercall
> mechanisms from working. So to communicate with VMM, TDX
> specification defines a new instruction called "tdcall".

You use all caps TDCALL everywhere else in this commit message.
Looks odd to have quoted lower case here.

> In a TDX based VM, since VMM is an untrusted entity, a intermediary
> layer (TDX module) exists between host and guest to facilitate the
> secure communication. TDX guests communicate with the TDX module and
> with the VMM using a new instruction: TDCALL.

Seems both repeat what was in the first paragraph, but also fail to
explain how this TDCALL is different from that first TDCALL.

> Implement common helper functions to communicate with the TDX Module
> and VMM (using TDCALL instruction).
>   
> __tdx_hypercall()    - request services from the VMM.
> __tdx_module_call()  - communicate with the TDX Module.

Looking at the code, the hypercall can return an error if TDCALL fails,
but module_call forces a panic with UD2 on error. This difference isn't
explained anywhere.

-Tony

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-05-27 15:25                             ` Luck, Tony
@ 2021-05-27 15:52                               ` Kuppuswamy, Sathyanarayanan
  2021-05-27 16:25                                 ` Luck, Tony
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-27 15:52 UTC (permalink / raw)
  To: Luck, Tony, Peter Zijlstra, Andy Lutomirski, Hansen, Dave
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Williams, Dan J, Raj, Ashok, Sean Christopherson, linux-kernel



On 5/27/21 8:25 AM, Luck, Tony wrote:
>> Guests communicate with VMMs with hypercalls. Historically, these
>> are implemented using instructions that are known to cause VMEXITs
>> like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
>> expose guest state to the host.  This prevents the old hypercall
>> mechanisms from working. So to communicate with VMM, TDX
>> specification defines a new instruction called "tdcall".
> 
> You use all caps TDCALL everywhere else in this commit message.
> Looks odd to have quoted lower case here.

I will use TDCALL uniformly.

> 
>> In a TDX based VM, since VMM is an untrusted entity, a intermediary
>> layer (TDX module) exists between host and guest to facilitate the
>> secure communication. TDX guests communicate with the TDX module and
>> with the VMM using a new instruction: TDCALL.
> 
> Seems both repeat what was in the first paragraph, but also fail to
> explain how this TDCALL is different from that first TDCALL.

Both cases uses TDCALL instruction. Arguments we pass confirms the
type of TDCALL ( one used to communicate with TDX module vs one used
to communicate with VMM).

I can modify the description to convey the difference between both
cases.

> 
>> Implement common helper functions to communicate with the TDX Module
>> and VMM (using TDCALL instruction).
>>     
>> __tdx_hypercall()    - request services from the VMM.
>> __tdx_module_call()  - communicate with the TDX Module.
> 
> Looking at the code, the hypercall can return an error if TDCALL fails,
> but module_call forces a panic with UD2 on error. This difference isn't
> explained anywhere.

I think you meant hypercall will panic vs module call will not.

In hypercall case, since we use same TDCALL instruction, we will have two
return values. One is for TDCALL failure (at the TDX module level) and
other is return value from VMM. So in hypercall case, we return VMM value
to the user but panic for TDCALL failures. As per TDX spec, for hypercall
use case, if everything is in order, TDCALL will never fail. If we notice
TDCALL failure error then it means, we are working with the broken TDX module.
So we panic.

> -Tony
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-27 15:11                 ` Luck, Tony
@ 2021-05-27 16:24                   ` Sean Christopherson
  2021-05-27 16:36                     ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-27 16:24 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Hansen, Dave, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Williams, Dan J, Raj, Ashok,
	linux-kernel

On Thu, May 27, 2021, Luck, Tony wrote:
> +struct ve_info {
> +	u64 exit_reason;
> +	u64 exit_qual;
> +	u64 gla;
> +	u64 gpa;
> +	u32 instr_len;
> +	u32 instr_info;
> +};
> 
> I guess that "gla" = Guest Linear Address ... which is a very "Intel" way of
> describing what everyone else would call a Guest Virtual Address.
> 
> I don't feel strongly about this though. If this has already been hashed
> out already then stick with this name.

The "real" #VE information area that TDX is usurping is an architectural struct
that defines exit_reason, exit_qual, gla, and gpa, and those fields in turn come
directly from their corresponding VMCS fields with longer versions of the same
names, e.g. ve_info->gla is a reflection of vmcs.GUEST_LINEAR_ADDRESS.

So normally I would agree that the "linear" terminology is obnoxious, but in
this specific case I think it's warranted.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* RE: [RFC v2-fix-v2 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-05-27 15:52                               ` Kuppuswamy, Sathyanarayanan
@ 2021-05-27 16:25                                 ` Luck, Tony
  0 siblings, 0 replies; 381+ messages in thread
From: Luck, Tony @ 2021-05-27 16:25 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Hansen, Dave
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Williams, Dan J, Raj, Ashok, Sean Christopherson, linux-kernel

>> Looking at the code, the hypercall can return an error if TDCALL fails,
>> but module_call forces a panic with UD2 on error. This difference isn't
>> explained anywhere.
>
> I think you meant hypercall will panic vs module call will not.

yes

> In hypercall case, since we use same TDCALL instruction, we will have two
> return values. One is for TDCALL failure (at the TDX module level) and
> other is return value from VMM. So in hypercall case, we return VMM value
> to the user but panic for TDCALL failures. As per TDX spec, for hypercall
> use case, if everything is in order, TDCALL will never fail. If we notice
> TDCALL failure error then it means, we are working with the broken TDX module.
> So we panic.

Add a comment in the .S file right before that ud2 explaining this. That
should help anyone tracking down that panic understand that the problem
is in the TDX module.

Otherwise looks ok.

Reviewed-by: Tony Luck <tony.luck@intel.com>

-Tony

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-27 16:24                   ` Sean Christopherson
@ 2021-05-27 16:36                     ` Dave Hansen
  0 siblings, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-27 16:36 UTC (permalink / raw)
  To: Sean Christopherson, Luck, Tony
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Williams, Dan J, Raj, Ashok, linux-kernel

On 5/27/21 9:24 AM, Sean Christopherson wrote:
> On Thu, May 27, 2021, Luck, Tony wrote:
>> +struct ve_info {
>> +	u64 exit_reason;
>> +	u64 exit_qual;
>> +	u64 gla;
>> +	u64 gpa;
>> +	u32 instr_len;
>> +	u32 instr_info;
>> +};
>>
>> I guess that "gla" = Guest Linear Address ... which is a very "Intel" way of
>> describing what everyone else would call a Guest Virtual Address.
>>
>> I don't feel strongly about this though. If this has already been hashed
>> out already then stick with this name.
> The "real" #VE information area that TDX is usurping is an architectural struct
> that defines exit_reason, exit_qual, gla, and gpa, and those fields in turn come
> directly from their corresponding VMCS fields with longer versions of the same
> names, e.g. ve_info->gla is a reflection of vmcs.GUEST_LINEAR_ADDRESS.
> 
> So normally I would agree that the "linear" terminology is obnoxious, but in
> this specific case I think it's warranted.

The architectural name needs to be *somewhere*.  But, we do diverge from
the naming in plenty of places.  The architectural name "XSTATE_BV" is
called xstate.xfeatures in the FPU code, for instance.

In this case, the _least_ we can do is:

	u64 gla; /* Guest Linear (virtual) Address */

although I also wouldn't mind if we did something like:

	u64 guest_vaddr; /* Guest Linear Address (gla) in the spec */

either.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v4 1/1] x86/boot: Avoid #VE during boot for TDX platforms
  2021-05-24 23:27                   ` [RFC v2-fix-v3 " Kuppuswamy Sathyanarayanan
@ 2021-05-27 21:25                     ` Kuppuswamy Sathyanarayanan
  2021-06-08 23:14                       ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-27 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Kuppuswamy Sathyanarayanan,
	linux-kernel

From: Sean Christopherson <sean.j.christopherson@intel.com>

There are a few MSRs and control register bits which the kernel
normally needs to modify during boot. But, TDX disallows
modification of these registers to help provide consistent
security guarantees. Fortunately, TDX ensures that these are all
in the correct state before the kernel loads, which means the
kernel has no need to modify them.

The conditions to avoid are:

  * Any writes to the EFER MSR
  * Clearing CR0.NE
  * Clearing CR3.MCE

This theoretically makes guest boot more fragile. If, for
instance, EFER was set up incorrectly and a WRMSR was performed,
it will trigger early exception panic or a triple fault, if it's
before early exceptions are set up. However, this is likely to
trip up the guest BIOS long before control reaches the kernel. In
any case, these kinds of problems are unlikely to occur in
production environments, and developers have good debug
tools to fix them quickly. 

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
Changes since RFC v2-fix-v3:
 * Removed uncessary contents from commit log. No code changes.

Changes since RFC v2-fix-v2:
 * Fixed commit log as per review comments.

Changes since RFC v2-fix:
 * Fixed commit and comments as per Dave and Dan's suggestions.
 * Merged CR0.NE related change in pa_trampoline_compat() from patch
   titled "x86/boot: Add a trampoline for APs booting in 64-bit mode"
   to this patch. It belongs in this patch.
 * Merged TRAMPOLINE_32BIT_CODE_SIZE related change from patch titled
   "x86/boot: Add a trampoline for APs booting in 64-bit mode" to this
   patch (since it was wrongly merged to that patch during patch split).

 arch/x86/boot/compressed/head_64.S   | 16 ++++++++++++----
 arch/x86/boot/compressed/pgtable.h   |  2 +-
 arch/x86/kernel/head_64.S            | 20 ++++++++++++++++++--
 arch/x86/realmode/rm/trampoline_64.S | 23 +++++++++++++++++++----
 4 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index e94874f4bbc1..f848569e3fb0 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -616,12 +616,20 @@ SYM_CODE_START(trampoline_32bit_src)
 	movl	$MSR_EFER, %ecx
 	rdmsr
 	btsl	$_EFER_LME, %eax
+	/* Avoid writing EFER if no change was made (for TDX guest) */
+	jc	1f
 	wrmsr
-	popl	%edx
+1:	popl	%edx
 	popl	%ecx
 
 	/* Enable PAE and LA57 (if required) paging modes */
-	movl	$X86_CR4_PAE, %eax
+	movl	%cr4, %eax
+	/*
+	 * Clear all bits except CR4.MCE, which is preserved.
+	 * Clearing CR4.MCE will #VE in TDX guests.
+	 */
+	andl	$X86_CR4_MCE, %eax
+	orl	$X86_CR4_PAE, %eax
 	testl	%edx, %edx
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
@@ -635,8 +643,8 @@ SYM_CODE_START(trampoline_32bit_src)
 	pushl	$__KERNEL_CS
 	pushl	%eax
 
-	/* Enable paging again */
-	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
+	/* Enable paging again. Avoid clearing X86_CR0_NE for TDX */
+	movl	$(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	lret
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
 
 #define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE	0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE	0x80
 
 #define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
 
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..6cf8d126b80a 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,13 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 1:
 
 	/* Enable PAE mode, PGE and LA57 */
-	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+	movq	%cr4, %rcx
+	/*
+	 * Clear all bits except CR4.MCE, which is preserved.
+	 * Clearing CR4.MCE will #VE in TDX guests.
+	 */
+	andl	$X86_CR4_MCE, %ecx
+	orl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
 #ifdef CONFIG_X86_5LEVEL
 	testl	$1, __pgtable_l5_enabled(%rip)
 	jz	1f
@@ -229,13 +235,23 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	/* Setup EFER (Extended Feature Enable Register) */
 	movl	$MSR_EFER, %ecx
 	rdmsr
+	/*
+	 * Preserve current value of EFER for comparison and to skip
+	 * EFER writes if no change was made (for TDX guest)
+	 */
+	movl    %eax, %edx
 	btsl	$_EFER_SCE, %eax	/* Enable System Call */
 	btl	$20,%edi		/* No Execute supported? */
 	jnc     1f
 	btsl	$_EFER_NX, %eax
 	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
-1:	wrmsr				/* Make changes effective */
 
+	/* Avoid writing EFER if no change was made (for TDX guest) */
+1:	cmpl	%edx, %eax
+	je	1f
+	xor	%edx, %edx
+	wrmsr				/* Make changes effective */
+1:
 	/* Setup cr0 */
 	movl	$CR0_STATE, %eax
 	/* Make changes effective */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 957bb21ce105..f121f5e29d50 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,27 @@ SYM_CODE_START(startup_32)
 	movl	%eax, %cr3
 
 	# Set up EFER
+	movl	$MSR_EFER, %ecx
+	rdmsr
+	/*
+	 * Skip writing to EFER if the register already has desired
+	 * value (to avoid #VE for the TDX guest).
+	 */
+	cmp	pa_tr_efer, %eax
+	jne	.Lwrite_efer
+	cmp	pa_tr_efer + 4, %edx
+	je	.Ldone_efer
+.Lwrite_efer:
 	movl	pa_tr_efer, %eax
 	movl	pa_tr_efer + 4, %edx
-	movl	$MSR_EFER, %ecx
 	wrmsr
 
-	# Enable paging and in turn activate Long Mode
-	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+.Ldone_efer:
+	/*
+	 * Enable paging and in turn activate Long Mode. Avoid clearing
+	 * X86_CR0_NE for TDX.
+	 */
+	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	/*
@@ -169,7 +183,8 @@ SYM_CODE_START(pa_trampoline_compat)
 	movl	$rm_stack_end, %esp
 	movw	$__KERNEL_DS, %dx
 
-	movl	$X86_CR0_PE, %eax
+	/* Avoid clearing X86_CR0_NE for TDX */
+	movl	$(X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 	ljmpl   $__KERNEL32_CS, $pa_startup_32
 SYM_CODE_END(pa_trampoline_compat)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-25 18:21                         ` Kuppuswamy, Sathyanarayanan
@ 2021-05-31 15:13                           ` Borislav Petkov
  2021-05-31 17:32                             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-05-31 15:13 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Tom Lendacky, Sean Christopherson, Dave Hansen, Andi Kleen,
	Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel, Brijesh Singh

On Tue, May 25, 2021 at 11:21:21AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> Following is the sample implementation. Please let me know your
> comments.

Doesn't look like what I suggested here:

https://lkml.kernel.org/r/YKfPLlulaqwypNkO@zn.tnic

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-31 15:13                           ` Borislav Petkov
@ 2021-05-31 17:32                             ` Kuppuswamy, Sathyanarayanan
  2021-05-31 17:55                               ` Borislav Petkov
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-31 17:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tom Lendacky, Sean Christopherson, Dave Hansen, Andi Kleen,
	Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel, Brijesh Singh



On 5/31/21 8:13 AM, Borislav Petkov wrote:
> On Tue, May 25, 2021 at 11:21:21AM -0700, Kuppuswamy, Sathyanarayanan wrote:
>> Following is the sample implementation. Please let me know your
>> comments.
> 
> Doesn't look like what I suggested here:
> 
> https://lkml.kernel.org/r/YKfPLlulaqwypNkO@zn.tnic

IIUC, following are your design suggestions:

1. Define generic flags.

I think following flags are defined as you have suggested.

+++ b/include/linux/protected_guest.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_PROTECTED_GUEST_H
+#define _LINUX_PROTECTED_GUEST_H 1
+
+/* Protected Guest Feature Flags (leave 0-0xff for arch specific flags) */
+
+/* Support for guest encryption */
+#define VM_MEM_ENCRYPT                 0x100
+/* Encryption support is active */
+#define VM_MEM_ENCRYPT_ACTIVE          0x101
+/* Support for unrolled string IO */
+#define VM_UNROLL_STRING_IO            0x102
+/* Support for host memory encryption */
+#define VM_HOST_MEM_ENCRYPT            0x103

2. Define generic functions and allow calls to arch specific implementations.

For above requirement, instead of calling arch specific functions from 
include/linux/protected_guest.h, I have directly included the arch specific file in 
linux/protected_guest.h

+#ifdef CONFIG_ARCH_HAS_PROTECTED_GUEST
+#include <asm/protected_guest.h>
+#else
+static inline bool is_protected_guest(void) { return false; }
+static inline bool protected_guest_has(unsigned long flag) { return false; }
+#endif

3. Implement arch specific implementations respond to protected_guest_has() calls right?

I think above requirement is satisfied in following implementation.

+++ b/arch/x86/include/asm/protected_guest.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_PROTECTED_GUEST
+#define _ASM_PROTECTED_GUEST 1
+
+#include <asm/cpufeature.h>
+#include <asm/tdx.h>
+
+/* Only include through linux/protected_guest.h */
+
+static inline bool is_protected_guest(void)
+{
+       return boot_cpu_has(X86_FEATURE_TDX_GUEST);
+}
+
+static inline bool protected_guest_has(unsigned long flag)
+{
+       if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
+               return tdx_protected_guest_has(flag);
+
+       return false;
+}
+

Did I misunderstand anything ? Please let me know your comments.



> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-31 17:32                             ` Kuppuswamy, Sathyanarayanan
@ 2021-05-31 17:55                               ` Borislav Petkov
  2021-05-31 18:45                                 ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-05-31 17:55 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Tom Lendacky, Sean Christopherson, Dave Hansen, Andi Kleen,
	Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel, Brijesh Singh

On Mon, May 31, 2021 at 10:32:44AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> I think above requirement is satisfied in following implementation.

Well, I suggested a single protected_guest_has() function which does:

        if (AMD)
                amd_protected_guest_has(...)
        else if (Intel)
                intel_protected_guest_has(...)
        else
                WARN()

where amd_protected_guest_has() is implemented in arch/x86/kernel/sev.c
and intel_protected_guest_has() is implemented in, as far as I can
follow your paths in the diff, in arch/x86/kernel/tdx.c.

No is_protected_guest() and no ARCH_HAS_PROTECTED_GUEST.

Just the above controlled by CONFIG_INTEL_TDX_GUEST or whatever
the TDX config item is gonna end up being and on the AMD side by
CONFIG_AMD_MEM_ENCRYPT.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-31 17:55                               ` Borislav Petkov
@ 2021-05-31 18:45                                 ` Kuppuswamy, Sathyanarayanan
  2021-05-31 19:14                                   ` Borislav Petkov
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-31 18:45 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tom Lendacky, Sean Christopherson, Dave Hansen, Andi Kleen,
	Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel, Brijesh Singh



On 5/31/21 10:55 AM, Borislav Petkov wrote:
> On Mon, May 31, 2021 at 10:32:44AM -0700, Kuppuswamy, Sathyanarayanan wrote:
>> I think above requirement is satisfied in following implementation.
> 
> Well, I suggested a single protected_guest_has() function which does:
> 
>          if (AMD)
>                  amd_protected_guest_has(...)
>          else if (Intel)
>                  intel_protected_guest_has(...)
>          else
>                  WARN()
> 
> where amd_protected_guest_has() is implemented in arch/x86/kernel/sev.c
> and intel_protected_guest_has() is implemented in, as far as I can
> follow your paths in the diff, in arch/x86/kernel/tdx.c.
> 
> No is_protected_guest() 

is_protected_guest() is a helper function added to check for VM guest type
(protected or normal). Andi is going to add some security hardening code in
virto and other some other generic drivers. He wants a helper function to
selective enable them for all protected guests. Since these are generic
drivers we need generic (non arch specific) helper call. is_protected_guest()
is proposed for this purpose.

We can also use protected_guest_has(VM_VIRTIO_SECURE_FIX) or something
similar for this purpose. Andi, any comments?

> and no ARCH_HAS_PROTECTED_GUEST.

IMHO, its better to use above generic config option in common header
file (linux/protected_guest.h). Any architecture that implements
protected guest feature can enable it. This will help is hide arch
specific config options in arch specific header file.

This seems to be a cleaner solution than including ARCH specific
CONFIG option options in common header file (linux/protected_guest.h)

#ifdef CONFIG_ARCH_HAS_PROTECTED_GUEST
#include <asm/protected_guest.h>
#else
blah
#endif

is better than

#ifdef (AMD)
amd_call()
#endif

#ifdef (INTEL)
intel_call()
#endif

#ifdef (ARM)
arm_call()
#endif


> 
> Just the above controlled by CONFIG_INTEL_TDX_GUEST or whatever
> the TDX config item is gonna end up being and on the AMD side by
> CONFIG_AMD_MEM_ENCRYPT.
> 
> Thx.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-31 18:45                                 ` Kuppuswamy, Sathyanarayanan
@ 2021-05-31 19:14                                   ` Borislav Petkov
  2021-06-01  2:07                                     ` [RFC v2-fix-v1 1/1] " Kuppuswamy Sathyanarayanan
  2021-06-01 21:16                                     ` [RFC v2 28/32] " Kuppuswamy, Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Borislav Petkov @ 2021-05-31 19:14 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Tom Lendacky, Sean Christopherson, Dave Hansen, Andi Kleen,
	Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel, Brijesh Singh

On Mon, May 31, 2021 at 11:45:38AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> We can also use protected_guest_has(VM_VIRTIO_SECURE_FIX) or something
> similar for this purpose. Andi, any comments?

protected_guest_has() is enough for that - no need for two functions.

> IMHO, its better to use above generic config option in common header
> file (linux/protected_guest.h). Any architecture that implements
> protected guest feature can enable it. This will help is hide arch
> specific config options in arch specific header file.

You define empty function stubs for when the arch config option is not
enabled. Everything else is unnecessary. When another architecture needs
this, then another architecture will generalize it like it is usually
done.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-05-20 20:56             ` Dave Hansen
@ 2021-05-31 21:46               ` Kirill A. Shutemov
  2021-06-01  2:08                 ` [RFC v2-fix-v1 1/1] x86/tdx: Exclude Shared bit from physical_mask Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Kirill A. Shutemov @ 2021-05-31 21:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Sean Christopherson, Kuppuswamy, Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel

On Thu, May 20, 2021 at 01:56:13PM -0700, Dave Hansen wrote:
> On 5/20/21 1:16 PM, Sean Christopherson wrote:
> > On Thu, May 20, 2021, Kuppuswamy, Sathyanarayanan wrote:
> >> So what is your proposal? "tdx_guest_" / "tdx_host_" ?
> >   1. Abstract things where appropriate, e.g. I'm guessing there is a clever way
> >      to deal with the shared vs. private inversion and avoid tdg_shared_mask
> >      altogether.
> 
> One example here would be to keep a structure like:
> 
> struct protected_mem_config
> {
> 	unsigned long p_set_bits;
> 	unsigned long p_clear_bits;
> }
> 
> Where 'p_set_bits' are the bits that need to be set to establish memory
> protection and 'p_clear_bits' are the bits that need to be cleared.
> physical_mask would clear both of them:
> 
> 	physical_mask &= ~(pmc.p_set_bits & pmc.p_set_bits);

For me it looks like an abstraction for sake of abstraction. More levels
of indirection without clear benefit. It doesn't add any more readability:
would you know what 'p_set_bits' stands for in two month? I'm not sure.

I would rather leave explicit check for protection flavour. It provides
better context for a reader.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v3 1/1] x86/tdx: ioapic: Add shared bit for IOAPIC base address
  2021-05-24 23:29     ` [RFC v2-fix-v2 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-06-01  1:28       ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-01  1:28 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson,
	Kuppuswamy Sathyanarayanan, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

The kernel interacts with each bare-metal IOAPIC with a special
MMIO page. When running under KVM, the guest's IOAPICs are
emulated by KVM.

When running as a TDX guest, the guest needs to mark each IOAPIC
mapping as "shared" with the host.  This ensures that TDX private
protections are not applied to the page, which allows the TDX host
emulation to work.

Earlier patches in this series modified ioremap() so that
ioremap()-created mappings such as virtio will be marked as
shared. However, the IOAPIC code does not use ioremap() and instead
uses the fixmap mechanism.

Introduce a special fixmap helper just for the IOAPIC code.  Ensure
that it marks IOAPIC pages as "shared".  This replaces
set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
allows custom 'prot' values.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-fix-v2:
 * Replaced is_tdx_guest() call with protected_guest_has() call.
 * Used pgprot_protected_guest() instead of prot_tdg_shared().

Changes since RFC v2:
 * Fixed commit log and comment as per review comment

 arch/x86/kernel/apic/io_apic.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 73ff4dd426a8..9c0dff0d7aa4 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -49,6 +49,7 @@
 #include <linux/slab.h>
 #include <linux/memblock.h>
 #include <linux/msi.h>
+#include <linux/protected_guest.h>
 
 #include <asm/irqdomain.h>
 #include <asm/io.h>
@@ -2675,6 +2676,18 @@ static struct resource * __init ioapic_setup_resources(void)
 	return res;
 }
 
+static void io_apic_set_fixmap_nocache(enum fixed_addresses idx,
+				       phys_addr_t phys)
+{
+	pgprot_t flags = FIXMAP_PAGE_NOCACHE;
+
+	/* Set TDX guest shared bit in pgprot flags */
+	if (protected_guest_has(VM_SHARED_MAPPING_INIT))
+		flags = pgprot_protected_guest(flags);
+
+	__set_fixmap(idx, phys, flags);
+}
+
 void __init io_apic_init_mappings(void)
 {
 	unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
@@ -2707,7 +2720,7 @@ void __init io_apic_init_mappings(void)
 				      __func__, PAGE_SIZE, PAGE_SIZE);
 			ioapic_phys = __pa(ioapic_phys);
 		}
-		set_fixmap_nocache(idx, ioapic_phys);
+		io_apic_set_fixmap_nocache(idx, ioapic_phys);
 		apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
 			__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
 			ioapic_phys);
@@ -2836,7 +2849,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
 	ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
 	ioapics[idx].mp_config.apicaddr = address;
 
-	set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
+	io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
 	if (bad_ioapic_register(idx)) {
 		clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
 		return -ENODEV;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v1 1/1] x86/kvm: Use bounce buffers for TD guest
  2021-04-26 18:01 ` [RFC v2 31/32] x86/kvm: Use bounce buffers for TD guest Kuppuswamy Sathyanarayanan
@ 2021-06-01  2:03   ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-01  2:03 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson,
	Kuppuswamy Sathyanarayanan, linux-kernel

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Intel TDX doesn't allow VMM to directly access guest private
memory. Any memory that is required for communication with
VMM must be shared explicitly. The same rule applies for any
any DMA to and fromTDX guest. All DMA pages had to marked as
shared pages. A generic way to achieve this without any changes
to device drivers is to use the SWIOTLB framework.

This method of handling is similar to AMD SEV. So extend this
support for TDX guest as well. Also since there are some common
code between AMD SEV and TDX guest in mem_encrypt_init(), move it
to mem_encrypt_common.c and call AMD specific init function from
it

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2:
 * Fixed commit log as per review comments.
 * Instead of moving all AMD related changes to mem_encrypt_common.c,
   created a AMD specific helper function amd_mem_encrypt_init() and
   called it from mem_encrypt_init().
 * Removed redundant changes in arch/x86/kernel/pci-swiotlb.c.

 arch/x86/include/asm/mem_encrypt_common.h |  2 ++
 arch/x86/kernel/tdx.c                     |  3 +++
 arch/x86/mm/mem_encrypt.c                 |  5 +----
 arch/x86/mm/mem_encrypt_common.c          | 16 ++++++++++++++++
 4 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mem_encrypt_common.h b/arch/x86/include/asm/mem_encrypt_common.h
index 697bc40a4e3d..48d98a3d64fd 100644
--- a/arch/x86/include/asm/mem_encrypt_common.h
+++ b/arch/x86/include/asm/mem_encrypt_common.h
@@ -8,11 +8,13 @@
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 bool amd_force_dma_unencrypted(struct device *dev);
+void __init amd_mem_encrypt_init(void);
 #else /* CONFIG_AMD_MEM_ENCRYPT */
 static inline bool amd_force_dma_unencrypted(struct device *dev)
 {
 	return false;
 }
+static inline void amd_mem_encrypt_init(void) {}
 #endif /* CONFIG_AMD_MEM_ENCRYPT */
 
 #endif
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e84ae4f302b8..31aa47ba8f91 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -8,6 +8,7 @@
 #include <asm/vmx.h>
 #include <asm/insn.h>
 #include <linux/sched/signal.h> /* force_sig_fault() */
+#include <linux/swiotlb.h>
 
 #include <linux/cpu.h>
 #include <linux/protected_guest.h>
@@ -536,6 +537,8 @@ void __init tdx_early_init(void)
 
 	legacy_pic = &null_legacy_pic;
 
+	swiotlb_force = SWIOTLB_FORCE;
+
 	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdg:cpu_hotplug",
 			  NULL, tdg_cpu_offline_prepare);
 
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 5a81f73dd61e..073f2105b4af 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -467,14 +467,11 @@ static void print_mem_encrypt_feature_info(void)
 }
 
 /* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
+void __init amd_mem_encrypt_init(void)
 {
 	if (!sme_me_mask)
 		return;
 
-	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
-	swiotlb_update_mem_attributes();
-
 	/*
 	 * With SEV, we need to unroll the rep string I/O instructions,
 	 * but SEV-ES supports them through the #VC handler.
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 661c9457c02e..24c9117547b4 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -9,6 +9,7 @@
 
 #include <asm/mem_encrypt_common.h>
 #include <linux/dma-mapping.h>
+#include <linux/swiotlb.h>
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
@@ -21,3 +22,18 @@ bool force_dma_unencrypted(struct device *dev)
 
 	return false;
 }
+
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void)
+{
+	/*
+	 * For TDX guest or SEV/SME, call into SWIOTLB to update
+	 * the SWIOTLB DMA buffers
+	 */
+	if (sme_me_mask || protected_guest_has(VM_MEM_ENCRYPT))
+		swiotlb_update_mem_attributes();
+
+	if (sme_me_mask)
+		amd_mem_encrypt_init();
+}
+
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 1/1] x86/tdx: Make DMA pages shared
  2021-05-18 22:31         ` Dave Hansen
@ 2021-06-01  2:06           ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-01  2:06 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson,
	Kuppuswamy Sathyanarayanan, linux-kernel

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Just like MKTME, TDX reassigns bits of the physical address for
metadata.  MKTME used several bits for an encryption KeyID. TDX
uses a single bit in guests to communicate whether a physical page
should be protected by TDX as private memory (bit set to 0) or
unprotected and shared with the VMM (bit set to 1).

__set_memory_enc_dec() is now aware about TDX and sets Shared bit
accordingly following with relevant TDX hypercall.

Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range
when converting memory to private. Using 4k page size limit is due
to current TDX spec restriction. Also, If the GPA (range) was
already mapped as an active, private page, the host VMM may remove
the private page from the TD by following the “Removing TD Private
Pages” sequence in the Intel TDX-module specification [1] to safely
block the mapping(s), flush the TLB and cache, and remove the
mapping(s).

BUG() if TDACCEPTPAGE fails (except "previously accepted page" case)
, as the guest is completely hosed if it can't access memory. 

[1] https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf

Tested-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-fix:
 * Using seperate file for Intel TDX specific memory initialization
   breaks binary compatibility (between AMD/TDX). So revert back to
   older version.
 * Fixed the commit log to reflect the above change.
 * Replaced is_tdx_guest() checks with appropriate
   protected_guest_has() checks.
 * Used tdx_hcall_gpa_intent() instead of __tdg_map_gpa() call.
 * Removed __tdg_map_gpa() helper function and added tdg_accept_page()
   related changes to tdx_hcall_gpa_intent().
 * Used pgprot_pg_shared_mask() macro for __pgprot(tdg_shared_mask()).
 * Fixed commit log  as per review comments.

Changes since RFC v2:
 * Since the common code between AMD-SEV and TDX is very minimal,
   defining a new config (X86_MEM_ENCRYPT_COMMON) for common code
   is not very useful. So createed a seperate file for Intel TDX
   specific memory initialization (similar to AMD SEV).
 * Removed patch titled "x86/mm: Move force_dma_unencrypted() to
   common code" from this series. And merged required changes in
   this patch.

 arch/x86/include/asm/pgtable.h   |  1 +
 arch/x86/kernel/tdx.c            | 34 ++++++++++++++++++-----
 arch/x86/mm/mem_encrypt_common.c |  3 +++
 arch/x86/mm/pat/set_memory.c     | 46 +++++++++++++++++++++++++++-----
 4 files changed, 71 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7988e1fc2ce9..87c93815c4d7 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -24,6 +24,7 @@
 /* Make the page accesable by VMM for protected guests */
 #define pgprot_protected_guest(prot) __pgprot(pgprot_val(prot) |	\
 					      tdg_shared_mask())
+#define pgprot_pg_shared_mask() __pgprot(tdg_shared_mask())
 
 #ifndef __ASSEMBLY__
 #include <asm/x86_init.h>
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 07610eab1c64..e84ae4f302b8 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -15,10 +15,14 @@
 /* TDX Module call Leaf IDs */
 #define TDINFO				1
 #define TDGETVEINFO			3
+#define TDACCEPTPAGE			6
 
 /* TDX hypercall Leaf IDs */
 #define TDVMCALL_MAP_GPA		0x10001
 
+/* TDX Module call error codes */
+#define TDX_PAGE_ALREADY_ACCEPTED       0x8000000000000001
+
 #define VE_GET_IO_TYPE(exit_qual)      (((exit_qual) & 8) ? 0 : 1)
 #define VE_GET_IO_SIZE(exit_qual)      (((exit_qual) & 7) + 1)
 #define VE_GET_PORT_NUM(exit_qual)     ((exit_qual) >> 16)
@@ -126,25 +130,43 @@ static void tdg_get_info(void)
 	physical_mask &= ~tdg_shared_mask();
 }
 
+static void tdg_accept_page(phys_addr_t gpa)
+{
+	u64 ret;
+
+	ret = __tdx_module_call(TDACCEPTPAGE, gpa, 0, 0, 0, NULL);
+
+	BUG_ON(ret && ret != TDX_PAGE_ALREADY_ACCEPTED);
+}
+
 /*
  * Inform the VMM of the guest's intent for this physical page:
  * shared with the VMM or private to the guest.  The VMM is
  * expected to change its mapping of the page in response.
- *
- * Note: shared->private conversions require further guest
- * action to accept the page.
  */
 int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
 			 enum tdx_map_type map_type)
 {
-	u64 ret;
+	u64 ret = 0;
+	int i;
 
 	if (map_type == TDX_MAP_SHARED)
 		gpa |= tdg_shared_mask();
 
-	ret = tdx_hypercall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0);
+	if (tdx_hypercall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0))
+		ret = -EIO;
 
-	return ret ? -EIO : 0;
+	if (ret || map_type == TDX_MAP_SHARED)
+		return ret;
+
+	/*
+	 * For shared->private conversion, accept the page using TDACCEPTPAGE
+	 * TDX module call.
+	 */
+	for (i = 0; i < numpages; i++)
+		tdg_accept_page(gpa + i * PAGE_SIZE);
+
+	return 0;
 }
 
 static __cpuidle void tdg_halt(void)
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 4a9a4d5f36cd..661c9457c02e 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -16,5 +16,8 @@ bool force_dma_unencrypted(struct device *dev)
 	if (sev_active() || sme_active())
 		return amd_force_dma_unencrypted(dev);
 
+	if (protected_guest_has(VM_MEM_ENCRYPT))
+		return true;
+
 	return false;
 }
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..56ea2079cc36 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -27,6 +27,7 @@
 #include <asm/proto.h>
 #include <asm/memtype.h>
 #include <asm/set_memory.h>
+#include <asm/tdx.h>
 
 #include "../mm_internal.h"
 
@@ -1972,13 +1973,16 @@ int set_memory_global(unsigned long addr, int numpages)
 				    __pgprot(_PAGE_GLOBAL), 0);
 }
 
-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
+static int __set_memory_protect(unsigned long addr, int numpages, bool protect)
 {
+	pgprot_t mem_protected_bits, mem_plain_bits;
 	struct cpa_data cpa;
+	enum tdx_map_type map_type;
 	int ret;
 
 	/* Nothing to do if memory encryption is not active */
-	if (!mem_encrypt_active())
+	if (!mem_encrypt_active() &&
+	    !protected_guest_has(VM_MEM_ENCRYPT_ACTIVE))
 		return 0;
 
 	/* Should not be working on unaligned addresses */
@@ -1988,8 +1992,25 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	memset(&cpa, 0, sizeof(cpa));
 	cpa.vaddr = &addr;
 	cpa.numpages = numpages;
-	cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
-	cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+
+	if (protected_guest_has(VM_SHARED_MAPPING_INIT)) {
+		mem_protected_bits = __pgprot(0);
+		mem_plain_bits = pgprot_pg_shared_mask();
+	} else {
+		mem_protected_bits = __pgprot(_PAGE_ENC);
+		mem_plain_bits = __pgprot(0);
+	}
+
+	if (protect) {
+		cpa.mask_set = mem_protected_bits;
+		cpa.mask_clr = mem_plain_bits;
+		map_type = TDX_MAP_PRIVATE;
+	} else {
+		cpa.mask_set = mem_plain_bits;
+		cpa.mask_clr = mem_protected_bits;
+		map_type = TDX_MAP_SHARED;
+	}
+
 	cpa.pgd = init_mm.pgd;
 
 	/* Must avoid aliasing mappings in the highmem code */
@@ -1998,8 +2019,16 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 
 	/*
 	 * Before changing the encryption attribute, we need to flush caches.
+	 *
+	 * For TDX we need to flush caches on private->shared. VMM is
+	 * responsible for flushing on shared->private.
 	 */
-	cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+	if (is_tdx_guest()) {
+		if (map_type == TDX_MAP_SHARED)
+			cpa_flush(&cpa, 1);
+	} else {
+		cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+	}
 
 	ret = __change_page_attr_set_clr(&cpa, 1);
 
@@ -2012,18 +2041,21 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	 */
 	cpa_flush(&cpa, 0);
 
+	if (!ret && protected_guest_has(VM_SHARED_MAPPING_INIT))
+		ret = tdx_hcall_gpa_intent(__pa(addr), numpages, map_type);
+
 	return ret;
 }
 
 int set_memory_encrypted(unsigned long addr, int numpages)
 {
-	return __set_memory_enc_dec(addr, numpages, true);
+	return __set_memory_protect(addr, numpages, true);
 }
 EXPORT_SYMBOL_GPL(set_memory_encrypted);
 
 int set_memory_decrypted(unsigned long addr, int numpages)
 {
-	return __set_memory_enc_dec(addr, numpages, false);
+	return __set_memory_protect(addr, numpages, false);
 }
 EXPORT_SYMBOL_GPL(set_memory_decrypted);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v1 1/1] x86/tdx: Make pages shared in ioremap()
  2021-05-31 19:14                                   ` Borislav Petkov
@ 2021-06-01  2:07                                     ` Kuppuswamy Sathyanarayanan
  2021-06-01 21:16                                     ` [RFC v2 28/32] " Kuppuswamy, Sathyanarayanan
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-01  2:07 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson,
	Kuppuswamy Sathyanarayanan, linux-kernel

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

All ioremap()ed pages that are not backed by normal memory (NONE or
RESERVED) have to be mapped as shared.

Reuse the infrastructure we have for AMD SEV.

Note that DMA code doesn't use ioremap() to convert memory to shared as
DMA buffers backed by normal memory. DMA code make buffer shared with
set_memory_decrypted().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2:
 * Replaced is_tdx_guest() checks with protected_guest_has() calls.
 * Renamed pgprot_tdg_shared() to pgprot_protected_guest()

 arch/x86/include/asm/pgtable.h | 4 ++++
 arch/x86/mm/ioremap.c          | 9 ++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..7988e1fc2ce9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -21,6 +21,10 @@
 #define pgprot_encrypted(prot)	__pgprot(__sme_set(pgprot_val(prot)))
 #define pgprot_decrypted(prot)	__pgprot(__sme_clr(pgprot_val(prot)))
 
+/* Make the page accesable by VMM for protected guests */
+#define pgprot_protected_guest(prot) __pgprot(pgprot_val(prot) |	\
+					      tdg_shared_mask())
+
 #ifndef __ASSEMBLY__
 #include <asm/x86_init.h>
 #include <asm/fpu/xstate.h>
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 9e5ccc56f8e0..f0d31f6fd98c 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -17,6 +17,7 @@
 #include <linux/mem_encrypt.h>
 #include <linux/efi.h>
 #include <linux/pgtable.h>
+#include <linux/protected_guest.h>
 
 #include <asm/set_memory.h>
 #include <asm/e820/api.h>
@@ -87,12 +88,12 @@ static unsigned int __ioremap_check_ram(struct resource *res)
 }
 
 /*
- * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
- * there the whole memory is already encrypted.
+ * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or
+ * private in TDX case) because there the whole memory is already encrypted.
  */
 static unsigned int __ioremap_check_encrypted(struct resource *res)
 {
-	if (!sev_active())
+	if (!sev_active() && !protected_guest_has(VM_MEM_ENCRYPT))
 		return 0;
 
 	switch (res->desc) {
@@ -244,6 +245,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
 	prot = PAGE_KERNEL_IO;
 	if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
 		prot = pgprot_encrypted(prot);
+	else if (protected_guest_has(VM_SHARED_MAPPING_INIT))
+		prot = pgprot_protected_guest(prot);
 
 	switch (pcm) {
 	case _PAGE_CACHE_MODE_UC:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v1 1/1] x86/tdx: Exclude Shared bit from physical_mask
  2021-05-31 21:46               ` Kirill A. Shutemov
@ 2021-06-01  2:08                 ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-01  2:08 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson,
	Kuppuswamy Sathyanarayanan, linux-kernel

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Just like MKTME, TDX reassigns bits of the physical address for
metadata.  MKTME used several bits for an encryption KeyID. TDX
uses a single bit in guests to communicate whether a physical page
should be protected by TDX as private memory (bit set to 0) or
unprotected and shared with the VMM (bit set to 1).

Add a helper, tdg_shared_mask() to generate the mask.  The processor
enumerates its physical address width to include the shared bit, which
means it gets included in __PHYSICAL_MASK by default.

Remove the shared mask from 'physical_mask' since any bits in
tdg_shared_mask() are not used for physical addresses in page table
entries.

Also, note that we cannot club shared mapping configuration between
AMD SME and Intel TDX Guest platforms in common function. SME has
to do it very early in __startup_64() as it sets the bit on all
memory, except what is used for communication. TDX can postpone it,
as it don't need any shared mapping in very early boot.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC-v2:
 * Renamed __PHYSICAL_MASK to physical_mask in commit subject.
 * Fixed commit log as per review comments.

 arch/x86/Kconfig           | 1 +
 arch/x86/include/asm/tdx.h | 6 ++++++
 arch/x86/kernel/tdx.c      | 9 +++++++++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7bc371d8ad7d..7e7ac99c4f4c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -879,6 +879,7 @@ config INTEL_TDX_GUEST
 	select X86_X2APIC
 	select SECURITY_LOCKDOWN_LSM
 	select ARCH_HAS_PROTECTED_GUEST
+	select X86_MEM_ENCRYPT_COMMON
 	help
 	  Provide support for running in a trusted domain on Intel processors
 	  equipped with Trusted Domain eXtenstions. TDX is a new Intel
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index dfdb303ef7e2..0808cbbde045 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -118,6 +118,8 @@ do {									\
 } while (0)
 #endif
 
+extern phys_addr_t tdg_shared_mask(void);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
@@ -137,6 +139,10 @@ static inline bool tdg_early_handle_ve(struct pt_regs *regs)
 	return false;
 }
 
+static inline phys_addr_t tdg_shared_mask(void)
+{
+	return 0;
+}
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #ifdef CONFIG_INTEL_TDX_GUEST_KVM
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 02a3273b09d2..29d4b06535ce 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -101,6 +101,12 @@ bool tdx_protected_guest_has(unsigned long flag)
 }
 EXPORT_SYMBOL_GPL(tdx_protected_guest_has);
 
+/* The highest bit of a guest physical address is the "sharing" bit */
+phys_addr_t tdg_shared_mask(void)
+{
+	return 1ULL << (td_info.gpa_width - 1);
+}
+
 static void tdg_get_info(void)
 {
 	u64 ret;
@@ -112,6 +118,9 @@ static void tdg_get_info(void)
 
 	td_info.gpa_width = out.rcx & GENMASK(5, 0);
 	td_info.attributes = out.rdx;
+
+	/* Exclude Shared bit from the __PHYSICAL_MASK */
+	physical_mask &= ~tdg_shared_mask();
 }
 
 static __cpuidle void tdg_halt(void)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 1/1] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-27  4:47                 ` [RFC v2-fix-v1 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-06-01  2:10                   ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-01  2:10 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson,
	Kuppuswamy Sathyanarayanan, linux-kernel

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Intel TDX doesn't allow VMM to access guest private memory. Any
memory that is required for communication with VMM must be shared
explicitly by setting the bit in page table entry. After setting
the shared bit, the conversion must be completed with MapGPA TDVMALL.
The call informs VMM about the conversion between private/shared
mappings. The shared memory is similar to unencrypted memory in AMD
SME/SEV terminology but the underlying process of sharing/un-sharing
the memory is different for Intel TDX guest platform.

SEV assumes that I/O devices can only do DMA to "decrypted"
physical addresses without the C-bit set.  In order for the CPU
to interact with this memory, the CPU needs a decrypted mapping.
To add this support, AMD SME code forces force_dma_unencrypted()
to return true for platforms that support AMD SEV feature. It will
be used for DMA memory allocation API to trigger
set_memory_decrypted() for platforms that support AMD SEV feature.

TDX is similar. So, to communicate with I/O devices, related pages
need to be marked as shared. As mentioned above, shared memory in
TDX architecture is similar to decrypted memory in AMD SME/SEV. So
similar to AMD SEV, force_dma_unencrypted() has to forced to return
true. This support is added in other patches in this series.

So move force_dma_unencrypted() out of AMD specific code and call
AMD specific (amd_force_dma_unencrypted()) initialization function
from it. force_dma_unencrypted() will be modified by later patches
to include Intel TDX guest platform specific initialization.

Also, introduce new config option X86_MEM_ENCRYPT_COMMON that has
to be selected by all x86 memory encryption features. This will be
selected by both AMD SEV and Intel TDX guest config options.

This is preparation for TDX changes in DMA code and it has not
functional change.    

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
Changes since RFC v2-fix-v1:
 * Added mem_encrypt_common.h and moved common encryption
   related function declarations to it.
Changes since RFC v2:
 * Instead of moving all the contents of force_dma_unencrypted() to
   mem_encrypt_common.c, create sub function for AMD and call it
   from common code.
 * Fixed commit log as per review comments.

 arch/x86/Kconfig                          |  8 ++++++--
 arch/x86/include/asm/mem_encrypt_common.h | 18 ++++++++++++++++++
 arch/x86/mm/Makefile                      |  2 ++
 arch/x86/mm/mem_encrypt.c                 |  5 +++--
 arch/x86/mm/mem_encrypt_common.c          | 20 ++++++++++++++++++++
 5 files changed, 49 insertions(+), 4 deletions(-)
 create mode 100644 arch/x86/include/asm/mem_encrypt_common.h
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fc588a64d1a0..7bc371d8ad7d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1531,14 +1531,18 @@ config X86_CPA_STATISTICS
 	  helps to determine the effectiveness of preserving large and huge
 	  page mappings when mapping protections are changed.
 
+config X86_MEM_ENCRYPT_COMMON
+	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+	select DYNAMIC_PHYSICAL_MASK
+	def_bool n
+
 config AMD_MEM_ENCRYPT
 	bool "AMD Secure Memory Encryption (SME) support"
 	depends on X86_64 && CPU_SUP_AMD
 	select DMA_COHERENT_POOL
-	select DYNAMIC_PHYSICAL_MASK
 	select ARCH_USE_MEMREMAP_PROT
-	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
 	select INSTRUCTION_DECODER
+	select X86_MEM_ENCRYPT_COMMON
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/include/asm/mem_encrypt_common.h b/arch/x86/include/asm/mem_encrypt_common.h
new file mode 100644
index 000000000000..697bc40a4e3d
--- /dev/null
+++ b/arch/x86/include/asm/mem_encrypt_common.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_MEM_ENCRYPT_COMMON_H
+#define _ASM_X86_MEM_ENCRYPT_COMMON_H
+
+#include <linux/mem_encrypt.h>
+#include <linux/device.h>
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+bool amd_force_dma_unencrypted(struct device *dev);
+#else /* CONFIG_AMD_MEM_ENCRYPT */
+static inline bool amd_force_dma_unencrypted(struct device *dev)
+{
+	return false;
+}
+#endif /* CONFIG_AMD_MEM_ENCRYPT */
+
+#endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
 
+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON)	+= mem_encrypt_common.o
+
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index ae78cef79980..5a81f73dd61e 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -29,6 +29,7 @@
 #include <asm/processor-flags.h>
 #include <asm/msr.h>
 #include <asm/cmdline.h>
+#include <asm/mem_encrypt_common.h>
 
 #include "mm_internal.h"
 
@@ -390,8 +391,8 @@ bool noinstr sev_es_active(void)
 	return sev_status & MSR_AMD64_SEV_ES_ENABLED;
 }
 
-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
+/* Override for DMA direct allocation check - AMD specific initialization */
+bool amd_force_dma_unencrypted(struct device *dev)
 {
 	/*
 	 * For SEV, all DMA must be to unencrypted addresses.
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..4a9a4d5f36cd
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Memory Encryption Support Common Code
+ *
+ * Copyright (C) 2021 Intel Corporation
+ *
+ * Author: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
+ */
+
+#include <asm/mem_encrypt_common.h>
+#include <linux/dma-mapping.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+	if (sev_active() || sme_active())
+		return amd_force_dma_unencrypted(dev);
+
+	return false;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-05-27  4:23           ` [RFC v2-fix-v1 1/3] tdx: Introduce generic protected_guest abstraction Kuppuswamy Sathyanarayanan
@ 2021-06-01 21:14             ` Kuppuswamy Sathyanarayanan
  2021-06-02 17:20               ` Sean Christopherson
                                 ` (2 more replies)
  0 siblings, 3 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-01 21:14 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Borislav Petkov
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson,
	Kuppuswamy Sathyanarayanan, linux-kernel, Tom Lendacky

Add a generic way to check if we run with an encrypted guest,
without requiring x86 specific ifdefs. This can then be used in
non architecture specific code. 

protected_guest_has() is used to check for protected guest
feature flags.

Originally-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
Changes since RFC v2-fix-v1:
 * Changed the title from "tdx: Introduce generic protected_guest
   abstraction" to "x86: Introduce generic protected guest"
 * Removed usage of ARCH_HAS_PROTECTED_GUEST and directly called TDX
   and AMD specific xx_protected_guest_has() variants from
   linux/protected_guest.h.
 * Added support for amd_protected_guest_has() helper function.
 * Removed redundant is_tdx_guest() check in tdx_protected_guest_has()
   function.
 * Fixed commit log to reflect the latest changes.

 arch/x86/include/asm/mem_encrypt.h |  4 +++
 arch/x86/include/asm/tdx.h         |  7 ++++++
 arch/x86/kernel/tdx.c              | 16 ++++++++++++
 arch/x86/mm/mem_encrypt.c          | 13 ++++++++++
 include/linux/protected_guest.h    | 40 ++++++++++++++++++++++++++++++
 5 files changed, 80 insertions(+)
 create mode 100644 include/linux/protected_guest.h

diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index 9c80c68d75b5..1492b0eb29d0 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -56,6 +56,8 @@ bool sev_es_active(void);
 
 #define __bss_decrypted __section(".bss..decrypted")
 
+bool amd_protected_guest_has(unsigned long flag);
+
 #else	/* !CONFIG_AMD_MEM_ENCRYPT */
 
 #define sme_me_mask	0ULL
@@ -86,6 +88,8 @@ early_set_memory_encrypted(unsigned long vaddr, unsigned long size) { return 0;
 
 static inline void mem_encrypt_free_decrypted_mem(void) { }
 
+static inline bool amd_protected_guest_has(unsigned long flag) { return false; }
+
 #define __bss_decrypted
 
 #endif	/* CONFIG_AMD_MEM_ENCRYPT */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f0c1912837c8..cbfe7479f2a3 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -71,6 +71,8 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
 		    struct tdx_hypercall_output *out);
 
+bool tdx_protected_guest_has(unsigned long flag);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
@@ -80,6 +82,11 @@ static inline bool is_tdx_guest(void)
 
 static inline void tdx_early_init(void) { };
 
+static inline bool tdx_protected_guest_has(unsigned long flag)
+{
+	return false;
+}
+
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #ifdef CONFIG_INTEL_TDX_GUEST_KVM
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 17725646eb30..b1cdb37a8636 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,7 @@
 #include <asm/vmx.h>
 
 #include <linux/cpu.h>
+#include <linux/protected_guest.h>
 
 /* TDX Module call Leaf IDs */
 #define TDINFO				1
@@ -75,6 +76,21 @@ bool is_tdx_guest(void)
 }
 EXPORT_SYMBOL_GPL(is_tdx_guest);
 
+bool tdx_protected_guest_has(unsigned long flag)
+{
+	switch (flag) {
+	case VM_MEM_ENCRYPT:
+	case VM_MEM_ENCRYPT_ACTIVE:
+	case VM_UNROLL_STRING_IO:
+	case VM_HOST_MEM_ENCRYPT:
+	case VM_SHARED_MAPPING_INIT:
+		return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(tdx_protected_guest_has);
+
 static void tdg_get_info(void)
 {
 	u64 ret;
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index ff08dc463634..7019eab20096 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -20,6 +20,7 @@
 #include <linux/bitops.h>
 #include <linux/dma-mapping.h>
 #include <linux/virtio_config.h>
+#include <linux/protected_guest.h>
 
 #include <asm/tlbflush.h>
 #include <asm/fixmap.h>
@@ -389,6 +390,18 @@ bool noinstr sev_es_active(void)
 	return sev_status & MSR_AMD64_SEV_ES_ENABLED;
 }
 
+bool amd_protected_guest_has(unsigned long flag)
+{
+	switch (flag) {
+	case VM_MEM_ENCRYPT:
+	case VM_MEM_ENCRYPT_ACTIVE:
+		return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(amd_protected_guest_has);
+
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
 {
diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
new file mode 100644
index 000000000000..303dfba81d52
--- /dev/null
+++ b/include/linux/protected_guest.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_PROTECTED_GUEST_H
+#define _LINUX_PROTECTED_GUEST_H 1
+
+#include <linux/mem_encrypt.h>
+
+/* Protected Guest Feature Flags (leave 0-0xff for arch specific flags) */
+
+/* Support for guest encryption */
+#define VM_MEM_ENCRYPT			0x100
+/* Encryption support is active */
+#define VM_MEM_ENCRYPT_ACTIVE		0x101
+/* Support for unrolled string IO */
+#define VM_UNROLL_STRING_IO		0x102
+/* Support for host memory encryption */
+#define VM_HOST_MEM_ENCRYPT		0x103
+/* Support for shared mapping initialization (after early init) */
+#define VM_SHARED_MAPPING_INIT		0x104
+
+#if defined(CONFIG_INTEL_TDX_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
+
+#include <asm/tdx.h>
+
+static inline bool protected_guest_has(unsigned long flag)
+{
+	if (is_tdx_guest())
+		return tdx_protected_guest_has(flag);
+	else if (mem_encrypt_active())
+		return amd_protected_guest_has(flag);
+
+	return false;
+}
+
+#else
+
+static inline bool protected_guest_has(unsigned long flag) { return false; }
+
+#endif
+
+#endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-31 19:14                                   ` Borislav Petkov
  2021-06-01  2:07                                     ` [RFC v2-fix-v1 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-06-01 21:16                                     ` Kuppuswamy, Sathyanarayanan
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-01 21:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tom Lendacky, Sean Christopherson, Dave Hansen, Andi Kleen,
	Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel, Brijesh Singh

Hi,

On 5/31/21 12:14 PM, Borislav Petkov wrote:
>> We can also use protected_guest_has(VM_VIRTIO_SECURE_FIX) or something
>> similar for this purpose. Andi, any comments?
> protected_guest_has() is enough for that - no need for two functions.
> 
>> IMHO, its better to use above generic config option in common header
>> file (linux/protected_guest.h). Any architecture that implements
>> protected guest feature can enable it. This will help is hide arch
>> specific config options in arch specific header file.
> You define empty function stubs for when the arch config option is not
> enabled. Everything else is unnecessary. When another architecture needs
> this, then another architecture will generalize it like it is usually
> done.

Please check the updated version in email titled "[RFC v2-fix-v2 1/1] x86:
Introduce generic protected guest abstraction".

We can continue the rest of the discussion in that email.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-01 21:14             ` [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction Kuppuswamy Sathyanarayanan
@ 2021-06-02 17:20               ` Sean Christopherson
  2021-06-02 18:15                 ` Tom Lendacky
  2021-06-02 18:19               ` Tom Lendacky
  2021-06-03 18:14               ` Borislav Petkov
  2 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-06-02 17:20 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Borislav Petkov, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel, Tom Lendacky

On Tue, Jun 01, 2021, Kuppuswamy Sathyanarayanan wrote:
> diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
> index 9c80c68d75b5..1492b0eb29d0 100644
> --- a/arch/x86/include/asm/mem_encrypt.h
> +++ b/arch/x86/include/asm/mem_encrypt.h
> @@ -56,6 +56,8 @@ bool sev_es_active(void);
>  
>  #define __bss_decrypted __section(".bss..decrypted")
>  
> +bool amd_protected_guest_has(unsigned long flag);


Why call one by the vendor (amd) and the other by the technology (tdx)?
sev_protected_guest_has() seems like the more logical name, e.g. if AMD CPUs
gain a new non-SEV technology then we'll have a mess.

> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index f0c1912837c8..cbfe7479f2a3 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -71,6 +71,8 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
>  		    struct tdx_hypercall_output *out);
>  
> +bool tdx_protected_guest_has(unsigned long flag);

...

> +static inline bool protected_guest_has(unsigned long flag)
> +{
> +	if (is_tdx_guest())
> +		return tdx_protected_guest_has(flag);
> +	else if (mem_encrypt_active())

Shouldn't this be sev_active()?  mem_encrypt_active() will return true for SME,
too.

> +		return amd_protected_guest_has(flag);
> +
> +	return false;
> +}

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-02 17:20               ` Sean Christopherson
@ 2021-06-02 18:15                 ` Tom Lendacky
  2021-06-02 18:25                   ` Kuppuswamy, Sathyanarayanan
  2021-06-02 18:29                   ` Borislav Petkov
  0 siblings, 2 replies; 381+ messages in thread
From: Tom Lendacky @ 2021-06-02 18:15 UTC (permalink / raw)
  To: Sean Christopherson, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Borislav Petkov, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On 6/2/21 12:20 PM, Sean Christopherson wrote:
> On Tue, Jun 01, 2021, Kuppuswamy Sathyanarayanan wrote:
>> diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
>> index 9c80c68d75b5..1492b0eb29d0 100644
>> --- a/arch/x86/include/asm/mem_encrypt.h
>> +++ b/arch/x86/include/asm/mem_encrypt.h
>> @@ -56,6 +56,8 @@ bool sev_es_active(void);
>>  
>>  #define __bss_decrypted __section(".bss..decrypted")
>>  
>> +bool amd_protected_guest_has(unsigned long flag);
> 
> 
> Why call one by the vendor (amd) and the other by the technology (tdx)?
> sev_protected_guest_has() seems like the more logical name, e.g. if AMD CPUs
> gain a new non-SEV technology then we'll have a mess.

The original suggestion from Boris, IIRC, was for protected_guest_has()
function (below) to be:

	if (intel)
		return intel_protected_guest_has();
	else if (amd)
		return amd_protected_guest_has();
	else
		return false;

And then you could check for TDX or SME/SEV in the respective functions.

> 
>> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
>> index f0c1912837c8..cbfe7479f2a3 100644
>> --- a/arch/x86/include/asm/tdx.h
>> +++ b/arch/x86/include/asm/tdx.h
>> @@ -71,6 +71,8 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>>  u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
>>  		    struct tdx_hypercall_output *out);
>>  
>> +bool tdx_protected_guest_has(unsigned long flag);
> 
> ...
> 
>> +static inline bool protected_guest_has(unsigned long flag)
>> +{
>> +	if (is_tdx_guest())
>> +		return tdx_protected_guest_has(flag);
>> +	else if (mem_encrypt_active())
> 
> Shouldn't this be sev_active()?  mem_encrypt_active() will return true for SME,
> too.

I believe Boris was wanting to replace the areas where sme_active() was
specifically checked, too. And so protected_guest_has() can be confusing...

Maybe naming it protected_os_has() or protection_attr_active() might work.
This would then work SME or MKTME as well.

Thanks,
Tom

> 
>> +		return amd_protected_guest_has(flag);
>> +
>> +	return false;
>> +}

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-01 21:14             ` [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction Kuppuswamy Sathyanarayanan
  2021-06-02 17:20               ` Sean Christopherson
@ 2021-06-02 18:19               ` Tom Lendacky
  2021-06-02 18:29                 ` Kuppuswamy, Sathyanarayanan
  2021-06-02 18:30                 ` Borislav Petkov
  2021-06-03 18:14               ` Borislav Petkov
  2 siblings, 2 replies; 381+ messages in thread
From: Tom Lendacky @ 2021-06-02 18:19 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Borislav Petkov
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 6/1/21 4:14 PM, Kuppuswamy Sathyanarayanan wrote:
> Add a generic way to check if we run with an encrypted guest,
> without requiring x86 specific ifdefs. This can then be used in
> non architecture specific code. 
> 
> protected_guest_has() is used to check for protected guest
> feature flags.
> 
> Originally-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
> Changes since RFC v2-fix-v1:
>  * Changed the title from "tdx: Introduce generic protected_guest
>    abstraction" to "x86: Introduce generic protected guest"
>  * Removed usage of ARCH_HAS_PROTECTED_GUEST and directly called TDX
>    and AMD specific xx_protected_guest_has() variants from
>    linux/protected_guest.h.
>  * Added support for amd_protected_guest_has() helper function.
>  * Removed redundant is_tdx_guest() check in tdx_protected_guest_has()
>    function.
>  * Fixed commit log to reflect the latest changes.

...

>  
> +bool amd_protected_guest_has(unsigned long flag)
> +{
> +	switch (flag) {
> +	case VM_MEM_ENCRYPT:
> +	case VM_MEM_ENCRYPT_ACTIVE:
> +		return true;
> +	}
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(amd_protected_guest_has);

This certainly doesn't capture all of the situations where true would need
to be returned. For example, SEV, but not SEV-ES, requires that string I/O
be unrolled, etc.

Thanks,
Tom


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-02 18:15                 ` Tom Lendacky
@ 2021-06-02 18:25                   ` Kuppuswamy, Sathyanarayanan
  2021-06-02 18:29                   ` Borislav Petkov
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-02 18:25 UTC (permalink / raw)
  To: Tom Lendacky, Sean Christopherson
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Borislav Petkov, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel



On 6/2/21 11:15 AM, Tom Lendacky wrote:
> The original suggestion from Boris, IIRC, was for protected_guest_has()
> function (below) to be:
> 
> 	if (intel)
> 		return intel_protected_guest_has();

Yes. But for Intel, I think currently we can only check for is_tdx_guest() here.

if (is_tdx_guest())
	return intel_protected_guest_has();

So if we use is_tdx_guest(), it is better to call tdx_protected_guest_has() here.

Once we start using protected_guest_has for other Intel technologies, may be
we can generalize it. Let me know your comments.

> 	else if (amd)
> 		return amd_protected_guest_has();
> 	else
> 		return false;
> 
> And then you could check for TDX or SME/SEV in the respective functions.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-02 18:15                 ` Tom Lendacky
  2021-06-02 18:25                   ` Kuppuswamy, Sathyanarayanan
@ 2021-06-02 18:29                   ` Borislav Petkov
  2021-06-02 18:32                     ` Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-06-02 18:29 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Sean Christopherson, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, linux-kernel

On Wed, Jun 02, 2021 at 01:15:23PM -0500, Tom Lendacky wrote:
> The original suggestion from Boris, IIRC, was for protected_guest_has()
> function (below) to be:
> 
> 	if (intel)
> 		return intel_protected_guest_has();
> 	else if (amd)
> 		return amd_protected_guest_has();
> 	else
> 		return false;
> 
> And then you could check for TDX or SME/SEV in the respective functions.

Yeah, a single function call which calls vendor-specific functions.

If you can point me to a tree with your patches, I can try to hack up
what I mean.

> I believe Boris was wanting to replace the areas where sme_active() was
> specifically checked, too. And so protected_guest_has() can be confusing...

We can always say

	protected_guest_has(SME_ACTIVE);

or so and then it is clear.

> Maybe naming it protected_os_has() or protection_attr_active() might work.
> This would then work SME or MKTME as well.

But other names are fine too once we're done with the bikeshedding.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-02 18:19               ` Tom Lendacky
@ 2021-06-02 18:29                 ` Kuppuswamy, Sathyanarayanan
  2021-06-02 18:30                 ` Borislav Petkov
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-02 18:29 UTC (permalink / raw)
  To: Tom Lendacky, Peter Zijlstra, Andy Lutomirski, Dave Hansen,
	Tony Luck, Borislav Petkov
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel



On 6/2/21 11:19 AM, Tom Lendacky wrote:
> This certainly doesn't capture all of the situations where true would need
> to be returned. For example, SEV, but not SEV-ES, requires that string I/O
> be unrolled, etc.

For AMD following cases should be true right? I can fix it in next version.

case VM_UNROLL_STRING_IO:
case VM_HOST_MEM_ENCRYPT:

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-02 18:19               ` Tom Lendacky
  2021-06-02 18:29                 ` Kuppuswamy, Sathyanarayanan
@ 2021-06-02 18:30                 ` Borislav Petkov
  1 sibling, 0 replies; 381+ messages in thread
From: Borislav Petkov @ 2021-06-02 18:30 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

On Wed, Jun 02, 2021 at 01:19:07PM -0500, Tom Lendacky wrote:
> This certainly doesn't capture all of the situations where true would need
> to be returned. For example, SEV, but not SEV-ES, requires that string I/O
> be unrolled, etc.

Yeah, I believe this would be better done for you guys, ontop, as you
know best what needs to be queried where. So this first patch adding
only a stub should be fine. Or you or someone else does the conversion
ontop of the Intel patch and then all patches go together.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-02 18:29                   ` Borislav Petkov
@ 2021-06-02 18:32                     ` Kuppuswamy, Sathyanarayanan
  2021-06-02 18:39                       ` Borislav Petkov
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-02 18:32 UTC (permalink / raw)
  To: Borislav Petkov, Tom Lendacky
  Cc: Sean Christopherson, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel



On 6/2/21 11:29 AM, Borislav Petkov wrote:
> If you can point me to a tree with your patches, I can try to hack up
> what I mean.

https://github.com/intel/tdx/commit/8515b66a0cb27d5ab66eda201285090faee742f7

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-02 18:32                     ` Kuppuswamy, Sathyanarayanan
@ 2021-06-02 18:39                       ` Borislav Petkov
  2021-06-02 18:45                         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-06-02 18:39 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Tom Lendacky, Sean Christopherson, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, linux-kernel

On Wed, Jun 02, 2021 at 11:32:18AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> 
> 
> On 6/2/21 11:29 AM, Borislav Petkov wrote:
> > If you can point me to a tree with your patches, I can try to hack up
> > what I mean.
> 
> https://github.com/intel/tdx/commit/8515b66a0cb27d5ab66eda201285090faee742f7

Ok, and which branch or tag?

tdx-guest-v5.12-7 or "guest"?

The github interface is yuck when one wants to look at commits...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-02 18:39                       ` Borislav Petkov
@ 2021-06-02 18:45                         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-02 18:45 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tom Lendacky, Sean Christopherson, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, linux-kernel



On 6/2/21 11:39 AM, Borislav Petkov wrote:
> Ok, and which branch or tag?
> 
> tdx-guest-v5.12-7 or "guest"?
> 
> The github interface is yuck when one wants to look at commits...

please use guest branch.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 0/2] x86/tdx: Handle in-kernel MMIO
  2021-05-07 21:52   ` Dave Hansen
  2021-05-18  0:48     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-06-02 19:42     ` Kuppuswamy Sathyanarayanan
  2021-06-02 19:42       ` [RFC v2-fix-v2 1/2] x86/sev-es: Abstract out MMIO instruction decoding Kuppuswamy Sathyanarayanan
  2021-06-02 19:42       ` [RFC v2-fix-v2 2/2] " Kuppuswamy Sathyanarayanan
  1 sibling, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-02 19:42 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson,
	Kuppuswamy Sathyanarayanan, linux-kernel

This patchset addresses the review comments in the patch titled
"x86/tdx: Handle in-kernel MMIO". Since it requires
patch split, sending these together.

Changes since RFC v2-fix:
 * Introduced "x86/sev-es: Abstract out MMIO instruction
   decoding" patch for sharing common code between TDX
   and SEV.
 * Modified TDX MMIO code to utilize common shared functions.
 * Modified commit log to reflect latest changes and to
   address review comments.

Changes since RFC v2:
 * Fixed commit log as per Dave's review.

Kirill A. Shutemov (2):
  x86/sev-es: Abstract out MMIO instruction decoding
  x86/tdx: Handle in-kernel MMIO

 arch/x86/include/asm/insn-eval.h |  13 +++
 arch/x86/kernel/sev.c            | 171 ++++++++-----------------------
 arch/x86/kernel/tdx.c            | 108 +++++++++++++++++++
 arch/x86/lib/insn-eval.c         | 102 ++++++++++++++++++
 4 files changed, 263 insertions(+), 131 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 1/2] x86/sev-es: Abstract out MMIO instruction decoding
  2021-06-02 19:42     ` [RFC v2-fix-v2 0/2] " Kuppuswamy Sathyanarayanan
@ 2021-06-02 19:42       ` Kuppuswamy Sathyanarayanan
  2021-06-05 21:56         ` Dan Williams
  2021-06-02 19:42       ` [RFC v2-fix-v2 2/2] " Kuppuswamy Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-02 19:42 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson,
	Kuppuswamy Sathyanarayanan, linux-kernel

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For regular virtual machine, MMIO is handled by the VMM: KVM
emulates instruction that caused MMIO. But, this model doesn't
work for a secure VMs (like SEV or TDX) as VMM doesn't have
access to the guest memory and register state. VMM needs
assistance in handling MMIO: it induces exception in the guest.
Guest has to decode the instruction and handle it on its own.

Instruction decoding logic is similar between AMD SEV and TDX
code. So extract the decoding code to insn-eval.c where it can
be used by both SEV and TDX.

This code adds no functional changes. It is only build-tested
for SEV.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/insn-eval.h |  13 +++
 arch/x86/kernel/sev.c            | 171 ++++++++-----------------------
 arch/x86/lib/insn-eval.c         | 102 ++++++++++++++++++
 3 files changed, 155 insertions(+), 131 deletions(-)

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index 91d7182ad2d6..4a4ca7e7be66 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -19,6 +19,7 @@ bool insn_has_rep_prefix(struct insn *insn);
 void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs);
 int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
 int insn_get_modrm_reg_off(struct insn *insn, struct pt_regs *regs);
+void *insn_get_modrm_reg_ptr(struct insn *insn, struct pt_regs *regs);
 unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx);
 int insn_get_code_seg_params(struct pt_regs *regs);
 int insn_fetch_from_user(struct pt_regs *regs,
@@ -28,4 +29,16 @@ int insn_fetch_from_user_inatomic(struct pt_regs *regs,
 bool insn_decode_from_regs(struct insn *insn, struct pt_regs *regs,
 			   unsigned char buf[MAX_INSN_SIZE], int buf_size);
 
+enum mmio_type {
+	MMIO_DECODE_FAILED,
+	MMIO_WRITE,
+	MMIO_WRITE_IMM,
+	MMIO_READ,
+	MMIO_READ_ZERO_EXTEND,
+	MMIO_READ_SIGN_EXTEND,
+	MMIO_MOVS,
+};
+
+enum mmio_type insn_decode_mmio(struct insn *insn, int *bytes);
+
 #endif /* _ASM_X86_INSN_EVAL_H */
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 651b81cd648e..f7a743d122eb 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -789,22 +789,6 @@ static void __init vc_early_forward_exception(struct es_em_ctxt *ctxt)
 	do_early_exception(ctxt->regs, trapnr);
 }
 
-static long *vc_insn_get_reg(struct es_em_ctxt *ctxt)
-{
-	long *reg_array;
-	int offset;
-
-	reg_array = (long *)ctxt->regs;
-	offset    = insn_get_modrm_reg_off(&ctxt->insn, ctxt->regs);
-
-	if (offset < 0)
-		return NULL;
-
-	offset /= sizeof(long);
-
-	return reg_array + offset;
-}
-
 static long *vc_insn_get_rm(struct es_em_ctxt *ctxt)
 {
 	long *reg_array;
@@ -852,76 +836,6 @@ static enum es_result vc_do_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
 	return sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, exit_info_1, exit_info_2);
 }
 
-static enum es_result vc_handle_mmio_twobyte_ops(struct ghcb *ghcb,
-						 struct es_em_ctxt *ctxt)
-{
-	struct insn *insn = &ctxt->insn;
-	unsigned int bytes = 0;
-	enum es_result ret;
-	int sign_byte;
-	long *reg_data;
-
-	switch (insn->opcode.bytes[1]) {
-		/* MMIO Read w/ zero-extension */
-	case 0xb6:
-		bytes = 1;
-		fallthrough;
-	case 0xb7:
-		if (!bytes)
-			bytes = 2;
-
-		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
-		if (ret)
-			break;
-
-		/* Zero extend based on operand size */
-		reg_data = vc_insn_get_reg(ctxt);
-		if (!reg_data)
-			return ES_DECODE_FAILED;
-
-		memset(reg_data, 0, insn->opnd_bytes);
-
-		memcpy(reg_data, ghcb->shared_buffer, bytes);
-		break;
-
-		/* MMIO Read w/ sign-extension */
-	case 0xbe:
-		bytes = 1;
-		fallthrough;
-	case 0xbf:
-		if (!bytes)
-			bytes = 2;
-
-		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
-		if (ret)
-			break;
-
-		/* Sign extend based on operand size */
-		reg_data = vc_insn_get_reg(ctxt);
-		if (!reg_data)
-			return ES_DECODE_FAILED;
-
-		if (bytes == 1) {
-			u8 *val = (u8 *)ghcb->shared_buffer;
-
-			sign_byte = (*val & 0x80) ? 0xff : 0x00;
-		} else {
-			u16 *val = (u16 *)ghcb->shared_buffer;
-
-			sign_byte = (*val & 0x8000) ? 0xff : 0x00;
-		}
-		memset(reg_data, sign_byte, insn->opnd_bytes);
-
-		memcpy(reg_data, ghcb->shared_buffer, bytes);
-		break;
-
-	default:
-		ret = ES_UNSUPPORTED;
-	}
-
-	return ret;
-}
-
 /*
  * The MOVS instruction has two memory operands, which raises the
  * problem that it is not known whether the access to the source or the
@@ -989,83 +903,78 @@ static enum es_result vc_handle_mmio_movs(struct es_em_ctxt *ctxt,
 		return ES_RETRY;
 }
 
-static enum es_result vc_handle_mmio(struct ghcb *ghcb,
-				     struct es_em_ctxt *ctxt)
+static enum es_result vc_handle_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 {
 	struct insn *insn = &ctxt->insn;
 	unsigned int bytes = 0;
+	enum mmio_type mmio;
 	enum es_result ret;
+	u8 sign_byte;
 	long *reg_data;
 
-	switch (insn->opcode.bytes[0]) {
-	/* MMIO Write */
-	case 0x88:
-		bytes = 1;
-		fallthrough;
-	case 0x89:
-		if (!bytes)
-			bytes = insn->opnd_bytes;
+	mmio = insn_decode_mmio(insn, &bytes);
+	if (mmio == MMIO_DECODE_FAILED)
+		return ES_DECODE_FAILED;
 
-		reg_data = vc_insn_get_reg(ctxt);
+	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
+		reg_data = insn_get_modrm_reg_ptr(insn, ctxt->regs);
 		if (!reg_data)
 			return ES_DECODE_FAILED;
+	}
 
+	switch (mmio) {
+	case MMIO_WRITE:
 		memcpy(ghcb->shared_buffer, reg_data, bytes);
-
 		ret = vc_do_mmio(ghcb, ctxt, bytes, false);
 		break;
-
-	case 0xc6:
-		bytes = 1;
-		fallthrough;
-	case 0xc7:
-		if (!bytes)
-			bytes = insn->opnd_bytes;
-
+	case MMIO_WRITE_IMM:
 		memcpy(ghcb->shared_buffer, insn->immediate1.bytes, bytes);
-
 		ret = vc_do_mmio(ghcb, ctxt, bytes, false);
 		break;
-
-		/* MMIO Read */
-	case 0x8a:
-		bytes = 1;
-		fallthrough;
-	case 0x8b:
-		if (!bytes)
-			bytes = insn->opnd_bytes;
-
+	case MMIO_READ:
 		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
 		if (ret)
 			break;
 
-		reg_data = vc_insn_get_reg(ctxt);
-		if (!reg_data)
-			return ES_DECODE_FAILED;
-
 		/* Zero-extend for 32-bit operation */
 		if (bytes == 4)
 			*reg_data = 0;
 
 		memcpy(reg_data, ghcb->shared_buffer, bytes);
 		break;
+	case MMIO_READ_ZERO_EXTEND:
+		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+		if (ret)
+			break;
+
+		memset(reg_data, 0, insn->opnd_bytes);
+		memcpy(reg_data, ghcb->shared_buffer, bytes);
+		break;
+	case MMIO_READ_SIGN_EXTEND:
+		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+		if (ret)
+			break;
 
-		/* MOVS instruction */
-	case 0xa4:
-		bytes = 1;
-		fallthrough;
-	case 0xa5:
-		if (!bytes)
-			bytes = insn->opnd_bytes;
+		if (bytes == 1) {
+			u8 *val = (u8 *)ghcb->shared_buffer;
 
-		ret = vc_handle_mmio_movs(ctxt, bytes);
+			sign_byte = (*val & 0x80) ? 0xff : 0x00;
+		} else {
+			u16 *val = (u16 *)ghcb->shared_buffer;
+
+			sign_byte = (*val & 0x8000) ? 0xff : 0x00;
+		}
+
+		/* Sign extend based on operand size */
+		memset(reg_data, sign_byte, insn->opnd_bytes);
+		memcpy(reg_data, ghcb->shared_buffer, bytes);
 		break;
-		/* Two-Byte Opcodes */
-	case 0x0f:
-		ret = vc_handle_mmio_twobyte_ops(ghcb, ctxt);
+	case MMIO_MOVS:
+		ret = vc_handle_mmio_movs(ctxt, bytes);
 		break;
 	default:
 		ret = ES_UNSUPPORTED;
+		break;
 	}
 
 	return ret;
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index a67afd74232c..81e8fe0bdc39 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -850,6 +850,26 @@ int insn_get_modrm_reg_off(struct insn *insn, struct pt_regs *regs)
 	return get_reg_offset(insn, regs, REG_TYPE_REG);
 }
 
+/**
+ * insn_get_modrm_reg_ptr() - Obtain register pointer based on ModRM byte
+ * @insn:	Instruction containing the ModRM byte
+ * @regs:	Register values as seen when entering kernel mode
+ *
+ * Returns:
+ *
+ * The register indicated by the reg part of the ModRM byte.
+ * The register is obtained as an pointer within pt_regs.
+ */
+void *insn_get_modrm_reg_ptr(struct insn *insn, struct pt_regs *regs)
+{
+	int offset;
+
+	offset = insn_get_modrm_reg_off(insn, regs);
+	if (offset < 0)
+		return NULL;
+	return (void *)regs + offset;
+}
+
 /**
  * get_seg_base_limit() - obtain base address and limit of a segment
  * @insn:	Instruction. Must be valid.
@@ -1539,3 +1559,85 @@ bool insn_decode_from_regs(struct insn *insn, struct pt_regs *regs,
 
 	return true;
 }
+
+/**
+ * insn_decode() - Decode a MMIO instruction
+ * @insn:	Structure to store decoded instruction
+ * @bytes:	Returns size of memory operand
+ *
+ * Decodes instruction that used for Memory-mapped I/O.
+ *
+ * Returns:
+ *
+ * Type of the instruction. Size of memory operand is stored in @bytes.
+ * If decode failed, MMIO_DECODE_FAILED returned.
+ */
+enum mmio_type insn_decode_mmio(struct insn *insn, int *bytes)
+{
+	int type = MMIO_DECODE_FAILED;
+
+	*bytes = 0;
+
+	insn_get_opcode(insn);
+	switch (insn->opcode.bytes[0]) {
+	case 0x88: /* MOV m8,r8 */
+		*bytes = 1;
+		fallthrough;
+	case 0x89: /* MOV m16/m32/m64, r16/m32/m64 */
+		if (!*bytes)
+			*bytes = insn->opnd_bytes;
+		type = MMIO_WRITE;
+		break;
+
+	case 0xc6: /* MOV m8, imm8 */
+		*bytes = 1;
+		fallthrough;
+	case 0xc7: /* MOV m16/m32/m64, imm16/imm32/imm64 */
+		if (!*bytes)
+			*bytes = insn->opnd_bytes;
+		type = MMIO_WRITE_IMM;
+		break;
+
+	case 0x8a: /* MOV r8, m8 */
+		*bytes = 1;
+		fallthrough;
+	case 0x8b: /* MOV r16/r32/r64, m16/m32/m64 */
+		if (!*bytes)
+			*bytes = insn->opnd_bytes;
+		type = MMIO_READ;
+		break;
+
+	case 0xa4: /* MOVS m8, m8 */
+		*bytes = 1;
+		fallthrough;
+	case 0xa5: /* MOVS m16/m32/m64, m16/m32/m64 */
+		if (!*bytes)
+			*bytes = insn->opnd_bytes;
+		type = MMIO_MOVS;
+		break;
+
+	case 0x0f: /* Two-byte instruction */
+		switch (insn->opcode.bytes[1]) {
+		case 0xb6: /* MOVZX r16/r32/r64, m8 */
+			*bytes = 1;
+			fallthrough;
+		case 0xb7: /* MOVZX r32/r64, m16 */
+			if (!*bytes)
+				*bytes = 2;
+			type = MMIO_READ_ZERO_EXTEND;
+			break;
+
+		case 0xbe: /* MOVSX r16/r32/r64, m8 */
+			*bytes = 1;
+			fallthrough;
+		case 0xbf: /* MOVSX r32/r64, m16 */
+			if (!*bytes)
+				*bytes = 2;
+			type = MMIO_READ_SIGN_EXTEND;
+			break;
+		}
+		break;
+	}
+
+	return type;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 2/2] x86/tdx: Handle in-kernel MMIO
  2021-06-02 19:42     ` [RFC v2-fix-v2 0/2] " Kuppuswamy Sathyanarayanan
  2021-06-02 19:42       ` [RFC v2-fix-v2 1/2] x86/sev-es: Abstract out MMIO instruction decoding Kuppuswamy Sathyanarayanan
@ 2021-06-02 19:42       ` Kuppuswamy Sathyanarayanan
  2021-06-02 21:01         ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-02 19:42 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson,
	Kuppuswamy Sathyanarayanan, linux-kernel

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

In traditional VMs, MMIO is usually implemented by giving a
guest access to a mapping which will cause a VMEXIT on access
and then the VMM emulating the access. That's not possible in
TDX guest because VMEXIT will expose the register state to the
host. TDX guests don't trust the host and can't have its state
exposed to the host. In TDX the MMIO regions are instead
configured to trigger a #VE exception in the guest. The guest #VE
handler then emulates the MMIO instruction inside the guest and
converts them into a controlled TDCALL to the host, rather than
completely exposing the state to the host.

Currently, we only support MMIO for instructions that are known
to come from io.h macros (build_mmio_read/write()). For drivers
that don't use the io.h macros or uses structure overlay to do
MMIO are currently not supported in TDX guest (for example the
MMIO based XAPIC is disable at runtime for TDX). User-space
access triggers SIGBUS.

This way of handling is similar to AMD SEV.

Also, reasons for supporting #VE based MMIO in TDX guest are,

* MMIO is widely used and we'll have more drivers in the future.
* We don't want to annotate every TDX specific MMIO readl/writel etc.
* If we didn't annotate we would need to add an alternative to every
  MMIO access in the kernel (even though 99.9% will never be used on
  TDX) which would be a complete waste and incredible binary bloat
  for nothing.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 108 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index d0e569b607bc..3687144b9131 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -5,6 +5,9 @@
 
 #include <asm/tdx.h>
 #include <asm/vmx.h>
+#include <asm/insn.h>
+#include <asm/insn-eval.h>
+#include <linux/sched/signal.h> /* force_sig_fault() */
 
 #include <linux/cpu.h>
 #include <linux/protected_guest.h>
@@ -236,6 +239,104 @@ static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
 	}
 }
 
+static unsigned long tdg_mmio(int size, bool write, unsigned long addr,
+		unsigned long *val)
+{
+	struct tdx_hypercall_output out = {0};
+	u64 err;
+
+	err = __tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
+			      addr, *val, &out);
+	*val = out.r11;
+	return err;
+}
+
+static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+	struct insn insn = {};
+	char buffer[MAX_INSN_SIZE];
+	enum mmio_type mmio;
+	unsigned long *reg;
+	int size, ret;
+	u8 sign_byte;
+	unsigned long val;
+
+	if (user_mode(regs)) {
+		ret = insn_fetch_from_user(regs, buffer);
+		if (!ret)
+			return -EFAULT;
+		if (!insn_decode_from_regs(&insn, regs, buffer, ret))
+			return -EFAULT;
+	} else {
+		ret = copy_from_kernel_nofault(buffer, (void *)regs->ip,
+					       MAX_INSN_SIZE);
+		if (ret)
+			return -EFAULT;
+		insn_init(&insn, buffer, MAX_INSN_SIZE, 1);
+		insn_get_length(&insn);
+	}
+
+	mmio = insn_decode_mmio(&insn, &size);
+	if (mmio == MMIO_DECODE_FAILED)
+		return -EFAULT;
+
+	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
+		reg = insn_get_modrm_reg_ptr(&insn, regs);
+		if (!reg)
+			return -EFAULT;
+	}
+
+	switch (mmio) {
+	case MMIO_WRITE:
+		memcpy(&val, reg, size);
+		ret = tdg_mmio(size, true, ve->gpa, &val);
+		break;
+	case MMIO_WRITE_IMM:
+		val = insn.immediate.value;
+		ret = tdg_mmio(size, true, ve->gpa, &val);
+		break;
+	case MMIO_READ:
+		ret = tdg_mmio(size, false, ve->gpa, &val);
+		if (ret)
+			break;
+		/* Zero-extend for 32-bit operation */
+		if (size == 4)
+			*reg = 0;
+		memcpy(reg, &val, size);
+		break;
+	case MMIO_READ_ZERO_EXTEND:
+		ret = tdg_mmio(size, false, ve->gpa, &val);
+		if (ret)
+			break;
+
+		/* Zero extend based on operand size */
+		memset(reg, 0, insn.opnd_bytes);
+		memcpy(reg, &val, size);
+		break;
+	case MMIO_READ_SIGN_EXTEND:
+		ret = tdg_mmio(size, false, ve->gpa, &val);
+		if (ret)
+			break;
+
+		if (size == 1)
+			sign_byte = (val & 0x80) ? 0xff : 0x00;
+		else
+			sign_byte = (val & 0x8000) ? 0xff : 0x00;
+
+		/* Sign extend based on operand size */
+		memset(reg, sign_byte, insn.opnd_bytes);
+		memcpy(reg, &val, size);
+		break;
+	case MMIO_MOVS:
+	case MMIO_DECODE_FAILED:
+		return -EFAULT;
+	}
+
+	if (ret)
+		return -EFAULT;
+	return insn.length;
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -285,6 +386,13 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_IO_INSTRUCTION:
 		tdg_handle_io(regs, ve->exit_qual);
 		break;
+	case EXIT_REASON_EPT_VIOLATION:
+		ve->instr_len = tdg_handle_mmio(regs, ve);
+		if (ve->instr_len < 0) {
+			pr_warn_once("MMIO failed\n");
+			return -EFAULT;
+		}
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Handle in-kernel MMIO
  2021-06-02 19:42       ` [RFC v2-fix-v2 2/2] " Kuppuswamy Sathyanarayanan
@ 2021-06-02 21:01         ` Andi Kleen
  2021-06-02 22:14           ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-06-02 21:01 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel

User-space
> access triggers SIGBUS.

Actually it looks like it's implemented below now, so that sentence 
could be dropped.


> +
> +	if (user_mode(regs)) {
> +		ret = insn_fetch_from_user(regs, buffer);
> +		if (!ret)
> +			return -EFAULT;
> +		if (!insn_decode_from_regs(&insn, regs, buffer, ret))
> +			return -EFAULT;
> +	} else {
> +		ret = copy_from_kernel_nofault(buffer, (void *)regs->ip,
> +					       MAX_INSN_SIZE);
> +		if (ret)
> +			return -EFAULT;
> +		insn_init(&insn, buffer, MAX_INSN_SIZE, 1);
> +		insn_get_length(&insn);
> +	}
> +

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 2/2] x86/tdx: Handle in-kernel MMIO
  2021-06-02 21:01         ` Andi Kleen
@ 2021-06-02 22:14           ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-02 22:14 UTC (permalink / raw)
  To: Andi Kleen, Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, Sean Christopherson, linux-kernel



On 6/2/21 2:01 PM, Andi Kleen wrote:
> Actually it looks like it's implemented below now, so that sentence could be dropped.

Will fix it in next version.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-01 21:14             ` [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction Kuppuswamy Sathyanarayanan
  2021-06-02 17:20               ` Sean Christopherson
  2021-06-02 18:19               ` Tom Lendacky
@ 2021-06-03 18:14               ` Borislav Petkov
  2021-06-03 18:15                 ` [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn Borislav Petkov
                                   ` (2 more replies)
  2 siblings, 3 replies; 381+ messages in thread
From: Borislav Petkov @ 2021-06-03 18:14 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Tom Lendacky

On Tue, Jun 01, 2021 at 02:14:17PM -0700, Kuppuswamy Sathyanarayanan wrote:
> diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
> new file mode 100644
> index 000000000000..303dfba81d52
> --- /dev/null
> +++ b/include/linux/protected_guest.h
> @@ -0,0 +1,40 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef _LINUX_PROTECTED_GUEST_H
> +#define _LINUX_PROTECTED_GUEST_H 1
> +
> +#include <linux/mem_encrypt.h>
> +
> +/* Protected Guest Feature Flags (leave 0-0xff for arch specific flags) */
> +
> +/* Support for guest encryption */
> +#define VM_MEM_ENCRYPT			0x100
> +/* Encryption support is active */
> +#define VM_MEM_ENCRYPT_ACTIVE		0x101
> +/* Support for unrolled string IO */
> +#define VM_UNROLL_STRING_IO		0x102
> +/* Support for host memory encryption */
> +#define VM_HOST_MEM_ENCRYPT		0x103
> +/* Support for shared mapping initialization (after early init) */
> +#define VM_SHARED_MAPPING_INIT		0x104

Ok, a couple of things:

first of all, those flags with that VM_ prefix make me think of
"virtual memory" instead of "virtual machine". So they should be
something else, like, say

PR_G_... for Protected Guest or so. Or PR_GUEST or ...

(yeah, good namespaces are all taken. )

Then, about the function name length, I'm fine if we did:

	prot_guest_has()

or something even shorter, if you folks have a good suggestion.

Anyway, below is a diff ontop of your tree with what I think the
barebones of this should be.

As a reply to this message I went and converted sme_active() to use
protected_guest_has() too.

Comments, complaints?

Thx.

---
diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index 1492b0eb29d0..9c80c68d75b5 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -56,8 +56,6 @@ bool sev_es_active(void);
 
 #define __bss_decrypted __section(".bss..decrypted")
 
-bool amd_protected_guest_has(unsigned long flag);
-
 #else	/* !CONFIG_AMD_MEM_ENCRYPT */
 
 #define sme_me_mask	0ULL
@@ -88,8 +86,6 @@ early_set_memory_encrypted(unsigned long vaddr, unsigned long size) { return 0;
 
 static inline void mem_encrypt_free_decrypted_mem(void) { }
 
-static inline bool amd_protected_guest_has(unsigned long flag) { return false; }
-
 #define __bss_decrypted
 
 #endif	/* CONFIG_AMD_MEM_ENCRYPT */
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index fa5cd05d3b5b..f09996c6a272 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -11,6 +11,7 @@
 #include <linux/types.h>
 #include <asm/insn.h>
 #include <asm/sev-common.h>
+#include <asm/pgtable_types.h>
 
 #define GHCB_PROTO_OUR		0x0001UL
 #define GHCB_PROTOCOL_MAX	1ULL
@@ -81,12 +82,15 @@ static __always_inline void sev_es_nmi_complete(void)
 		__sev_es_nmi_complete();
 }
 extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
+bool sev_protected_guest_has(unsigned long flag);
+
 #else
 static inline void sev_es_ist_enter(struct pt_regs *regs) { }
 static inline void sev_es_ist_exit(void) { }
 static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
 static inline void sev_es_nmi_complete(void) { }
 static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
+static inline bool sev_protected_guest_has(unsigned long flag) { return false; }
 #endif
 
-#endif
+#endif /* __ASM_ENCRYPTED_STATE_H */
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index f7a743d122eb..01a224fdb897 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -1402,3 +1402,14 @@ bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
 	while (true)
 		halt();
 }
+
+bool sev_protected_guest_has(unsigned long flag)
+{
+	switch (flag) {
+	case VM_MEM_ENCRYPT:
+	case VM_MEM_ENCRYPT_ACTIVE:
+		return true;
+	}
+
+	return false;
+}
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index ced658e79753..49d11bb6e02a 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -391,18 +391,6 @@ bool noinstr sev_es_active(void)
 	return sev_status & MSR_AMD64_SEV_ES_ENABLED;
 }
 
-bool amd_protected_guest_has(unsigned long flag)
-{
-	switch (flag) {
-	case VM_MEM_ENCRYPT:
-	case VM_MEM_ENCRYPT_ACTIVE:
-		return true;
-	}
-
-	return false;
-}
-EXPORT_SYMBOL_GPL(amd_protected_guest_has);
-
 /* Override for DMA direct allocation check - AMD specific initialization */
 bool amd_force_dma_unencrypted(struct device *dev)
 {
diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
index 6855d5b3e244..bb4b1a06b21f 100644
--- a/include/linux/protected_guest.h
+++ b/include/linux/protected_guest.h
@@ -2,7 +2,9 @@
 #ifndef _LINUX_PROTECTED_GUEST_H
 #define _LINUX_PROTECTED_GUEST_H 1
 
-#include <linux/mem_encrypt.h>
+#include <asm/processor.h>
+#include <asm/tdx.h>
+#include <asm/sev.h>
 
 /* Protected Guest Feature Flags (leave 0-0xff for arch specific flags) */
 
@@ -20,23 +22,18 @@
 #define VM_DISABLE_UNCORE_SUPPORT	0x105
 
 #if defined(CONFIG_INTEL_TDX_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
-
-#include <asm/tdx.h>
-
 static inline bool protected_guest_has(unsigned long flag)
 {
 	if (is_tdx_guest())
 		return tdx_protected_guest_has(flag);
-	else if (mem_encrypt_active())
-		return amd_protected_guest_has(flag);
+	else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
+		return sev_protected_guest_has(flag);
 
 	return false;
 }
 
 #else
-
 static inline bool protected_guest_has(unsigned long flag) { return false; }
-
 #endif
 
-#endif
+#endif /* _LINUX_PROTECTED_GUEST_H */


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn
  2021-06-03 18:14               ` Borislav Petkov
@ 2021-06-03 18:15                 ` Borislav Petkov
  2021-06-04 22:01                   ` Tom Lendacky
  2021-06-07 19:55                   ` Kirill A. Shutemov
  2021-06-03 18:33                 ` [RFC v2-fix-v2 " Kuppuswamy, Sathyanarayanan
  2021-06-07 18:01                 ` Kuppuswamy, Sathyanarayanan
  2 siblings, 2 replies; 381+ messages in thread
From: Borislav Petkov @ 2021-06-03 18:15 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Tom Lendacky

From f1e9f051c86b09fe660f49b0307bc7c6cec5e6f4 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp@suse.de>
Date: Thu, 3 Jun 2021 20:03:31 +0200
Subject: Convert sme_active()

diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index 9c80c68d75b5..1bb9f22629fc 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -50,7 +50,6 @@ void __init mem_encrypt_free_decrypted_mem(void);
 void __init mem_encrypt_init(void);
 
 void __init sev_es_init_vc_handling(void);
-bool sme_active(void);
 bool sev_active(void);
 bool sev_es_active(void);
 
@@ -75,7 +74,6 @@ static inline void __init sme_encrypt_kernel(struct boot_params *bp) { }
 static inline void __init sme_enable(struct boot_params *bp) { }
 
 static inline void sev_es_init_vc_handling(void) { }
-static inline bool sme_active(void) { return false; }
 static inline bool sev_active(void) { return false; }
 static inline bool sev_es_active(void) { return false; }
 
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index c078b0d3ab0e..1d88232146ab 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -387,7 +387,7 @@ void machine_kexec(struct kimage *image)
 				       (unsigned long)page_list,
 				       image->start,
 				       image->preserve_context,
-				       sme_active());
+				       protected_guest_has(VM_HOST_MEM_ENCRYPT));
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index c2cfa5e7c152..ce6f2b9a05c7 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -49,7 +49,7 @@ int __init pci_swiotlb_detect_4gb(void)
 	 * buffers are allocated and used for devices that do not support
 	 * the addressing range required for the encryption mask.
 	 */
-	if (sme_active())
+	if (protected_guest_has(VM_HOST_MEM_ENCRYPT))
 		swiotlb = 1;
 
 	return swiotlb;
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 01a224fdb897..3aa2658ced52 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -1409,6 +1409,11 @@ bool sev_protected_guest_has(unsigned long flag)
 	case VM_MEM_ENCRYPT:
 	case VM_MEM_ENCRYPT_ACTIVE:
 		return true;
+	case VM_HOST_MEM_ENCRYPT:
+		return sme_me_mask && !sev_active();
+	default:
+		WARN_ON_ONCE(1);
+		return false;
 	}
 
 	return false;
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 667bba74e4c8..50ed2a768844 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -703,7 +703,7 @@ bool arch_memremap_can_ram_remap(resource_size_t phys_addr, unsigned long size,
 	if (flags & MEMREMAP_DEC)
 		return false;
 
-	if (sme_active()) {
+	if (protected_guest_has(VM_HOST_MEM_ENCRYPT)) {
 		if (memremap_is_setup_data(phys_addr, size) ||
 		    memremap_is_efi_data(phys_addr, size))
 			return false;
@@ -729,7 +729,7 @@ pgprot_t __init early_memremap_pgprot_adjust(resource_size_t phys_addr,
 
 	encrypted_prot = true;
 
-	if (sme_active()) {
+	if (protected_guest_has(VM_HOST_MEM_ENCRYPT)) {
 		if (early_memremap_is_setup_data(phys_addr, size) ||
 		    memremap_is_efi_data(phys_addr, size))
 			encrypted_prot = false;
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 49d11bb6e02a..9b0cdac895ca 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -145,7 +145,7 @@ void __init sme_unmap_bootdata(char *real_mode_data)
 	struct boot_params *boot_data;
 	unsigned long cmdline_paddr;
 
-	if (!sme_active())
+	if (!protected_guest_has(VM_HOST_MEM_ENCRYPT))
 		return;
 
 	/* Get the command line address before unmapping the real_mode_data */
@@ -165,7 +165,7 @@ void __init sme_map_bootdata(char *real_mode_data)
 	struct boot_params *boot_data;
 	unsigned long cmdline_paddr;
 
-	if (!sme_active())
+	if (!protected_guest_has(VM_HOST_MEM_ENCRYPT))
 		return;
 
 	__sme_early_map_unmap_mem(real_mode_data, sizeof(boot_params), true);
@@ -365,7 +365,7 @@ int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size)
 /*
  * SME and SEV are very similar but they are not the same, so there are
  * times that the kernel will need to distinguish between SME and SEV. The
- * sme_active() and sev_active() functions are used for this.  When a
+ * protected_guest_has(VM_HOST_MEM_ENCRYPT) and sev_active() functions are used for this.  When a
  * distinction isn't needed, the mem_encrypt_active() function can be used.
  *
  * The trampoline code is a good example for this requirement.  Before
@@ -378,11 +378,6 @@ bool sev_active(void)
 {
 	return sev_status & MSR_AMD64_SEV_ENABLED;
 }
-
-bool sme_active(void)
-{
-	return sme_me_mask && !sev_active();
-}
 EXPORT_SYMBOL_GPL(sev_active);
 
 /* Needs to be called from non-instrumentable code */
@@ -405,7 +400,7 @@ bool amd_force_dma_unencrypted(struct device *dev)
 	 * device does not support DMA to addresses that include the
 	 * encryption mask.
 	 */
-	if (sme_active()) {
+	if (protected_guest_has(VM_HOST_MEM_ENCRYPT)) {
 		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
 		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
 						dev->bus_dma_limit);
@@ -446,7 +441,7 @@ static void print_mem_encrypt_feature_info(void)
 	pr_info("AMD Memory Encryption Features active:");
 
 	/* Secure Memory Encryption */
-	if (sme_active()) {
+	if (protected_guest_has(VM_HOST_MEM_ENCRYPT)) {
 		/*
 		 * SME is mutually exclusive with any of the SEV
 		 * features below.
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index da94fc2e9b56..286357956762 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -15,7 +15,7 @@
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
 {
-	if (sev_active() || sme_active())
+	if (sev_active() || protected_guest_has(VM_HOST_MEM_ENCRYPT))
 		return amd_force_dma_unencrypted(dev);
 
 	if (protected_guest_has(VM_MEM_ENCRYPT))
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index a9639f663d25..a92b49aa0d73 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -30,6 +30,7 @@
 #include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/mem_encrypt.h>
+#include <linux/protected_guest.h>
 
 #include <asm/setup.h>
 #include <asm/sections.h>
@@ -287,7 +288,7 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
 	unsigned long pgtable_area_len;
 	unsigned long decrypted_base;
 
-	if (!sme_active())
+	if (!protected_guest_has(VM_HOST_MEM_ENCRYPT))
 		return;
 
 	/*
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 2e1c1bec0f9e..7f9a708986a3 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -42,7 +42,7 @@ void __init reserve_real_mode(void)
 static void sme_sev_setup_real_mode(struct trampoline_header *th)
 {
 #ifdef CONFIG_AMD_MEM_ENCRYPT
-	if (sme_active())
+	if (protected_guest_has(VM_HOST_MEM_ENCRYPT))
 		th->flags |= TH_FLAGS_SME_ACTIVE;
 
 	if (sev_es_active()) {
@@ -79,7 +79,7 @@ static void __init setup_real_mode(void)
 	 * decrypted memory in order to bring up other processors
 	 * successfully. This is not needed for SEV.
 	 */
-	if (sme_active())
+	if (protected_guest_has(VM_HOST_MEM_ENCRYPT))
 		set_memory_decrypted((unsigned long)base, size >> PAGE_SHIFT);
 
 	memcpy(base, real_mode_blob, size);
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index d006724f4dc2..3c2365f13cc3 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -965,7 +965,7 @@ static bool copy_device_table(void)
 		pr_err("The address of old device table is above 4G, not trustworthy!\n");
 		return false;
 	}
-	old_devtb = (sme_active() && is_kdump_kernel())
+	old_devtb = (protected_guest_has(VM_HOST_MEM_ENCRYPT) && is_kdump_kernel())
 		    ? (__force void *)ioremap_encrypted(old_devtb_phys,
 							dev_table_size)
 		    : memremap(old_devtb_phys, dev_table_size, MEMREMAP_WB);
@@ -3022,7 +3022,7 @@ static int __init amd_iommu_init(void)
 
 static bool amd_iommu_sme_check(void)
 {
-	if (!sme_active() || (boot_cpu_data.x86 != 0x17))
+	if (!protected_guest_has(VM_HOST_MEM_ENCRYPT) || (boot_cpu_data.x86 != 0x17))
 		return true;
 
 	/* For Fam17h, a specific level of support is required */

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-03 18:14               ` Borislav Petkov
  2021-06-03 18:15                 ` [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn Borislav Petkov
@ 2021-06-03 18:33                 ` Kuppuswamy, Sathyanarayanan
  2021-06-03 18:41                   ` Borislav Petkov
  2021-06-07 18:01                 ` Kuppuswamy, Sathyanarayanan
  2 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-03 18:33 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Tom Lendacky



On 6/3/21 11:14 AM, Borislav Petkov wrote:
> On Tue, Jun 01, 2021 at 02:14:17PM -0700, Kuppuswamy Sathyanarayanan wrote:
>> diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
>> new file mode 100644
>> index 000000000000..303dfba81d52
>> --- /dev/null
>> +++ b/include/linux/protected_guest.h
>> @@ -0,0 +1,40 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +#ifndef _LINUX_PROTECTED_GUEST_H
>> +#define _LINUX_PROTECTED_GUEST_H 1
>> +
>> +#include <linux/mem_encrypt.h>
>> +
>> +/* Protected Guest Feature Flags (leave 0-0xff for arch specific flags) */
>> +
>> +/* Support for guest encryption */
>> +#define VM_MEM_ENCRYPT			0x100
>> +/* Encryption support is active */
>> +#define VM_MEM_ENCRYPT_ACTIVE		0x101
>> +/* Support for unrolled string IO */
>> +#define VM_UNROLL_STRING_IO		0x102
>> +/* Support for host memory encryption */
>> +#define VM_HOST_MEM_ENCRYPT		0x103
>> +/* Support for shared mapping initialization (after early init) */
>> +#define VM_SHARED_MAPPING_INIT		0x104
> 
> Ok, a couple of things:
> 
> first of all, those flags with that VM_ prefix make me think of
> "virtual memory" instead of "virtual machine". So they should be
> something else, like, say
> 
> PR_G_... for Protected Guest or so. Or PR_GUEST or ...

I would prefer PR_GUEST over PR_G_

> 
> (yeah, good namespaces are all taken. )
> 
> Then, about the function name length, I'm fine if we did:
> 
> 	prot_guest_has()
> 
> or something even shorter, if you folks have a good suggestion.
> 
> Anyway, below is a diff ontop of your tree with what I think the
> barebones of this should be.
> 
> As a reply to this message I went and converted sme_active() to use
> protected_guest_has() too.
> 
> Comments, complaints?
> 
> Thx.
> 
> ---
> diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
> index 1492b0eb29d0..9c80c68d75b5 100644
> --- a/arch/x86/include/asm/mem_encrypt.h
> +++ b/arch/x86/include/asm/mem_encrypt.h
> @@ -56,8 +56,6 @@ bool sev_es_active(void);
>   
>   #define __bss_decrypted __section(".bss..decrypted")
>   
> -bool amd_protected_guest_has(unsigned long flag);
> -
>   #else	/* !CONFIG_AMD_MEM_ENCRYPT */
>   
>   #define sme_me_mask	0ULL
> @@ -88,8 +86,6 @@ early_set_memory_encrypted(unsigned long vaddr, unsigned long size) { return 0;
>   
>   static inline void mem_encrypt_free_decrypted_mem(void) { }
>   
> -static inline bool amd_protected_guest_has(unsigned long flag) { return false; }
> -
>   #define __bss_decrypted
>   
>   #endif	/* CONFIG_AMD_MEM_ENCRYPT */
> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> index fa5cd05d3b5b..f09996c6a272 100644
> --- a/arch/x86/include/asm/sev.h
> +++ b/arch/x86/include/asm/sev.h
> @@ -11,6 +11,7 @@
>   #include <linux/types.h>
>   #include <asm/insn.h>
>   #include <asm/sev-common.h>
> +#include <asm/pgtable_types.h>
>   
>   #define GHCB_PROTO_OUR		0x0001UL
>   #define GHCB_PROTOCOL_MAX	1ULL
> @@ -81,12 +82,15 @@ static __always_inline void sev_es_nmi_complete(void)
>   		__sev_es_nmi_complete();
>   }
>   extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
> +bool sev_protected_guest_has(unsigned long flag);
> +
>   #else
>   static inline void sev_es_ist_enter(struct pt_regs *regs) { }
>   static inline void sev_es_ist_exit(void) { }
>   static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
>   static inline void sev_es_nmi_complete(void) { }
>   static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
> +static inline bool sev_protected_guest_has(unsigned long flag) { return false; }
>   #endif
>   
> -#endif
> +#endif /* __ASM_ENCRYPTED_STATE_H */
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index f7a743d122eb..01a224fdb897 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -1402,3 +1402,14 @@ bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
>   	while (true)
>   		halt();
>   }
> +
> +bool sev_protected_guest_has(unsigned long flag)
> +{
> +	switch (flag) {
> +	case VM_MEM_ENCRYPT:
> +	case VM_MEM_ENCRYPT_ACTIVE:
> +		return true;
> +	}
> +
> +	return false;
> +}

I assume this file will get compiled for both SEV and SME cases.

> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index ced658e79753..49d11bb6e02a 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -391,18 +391,6 @@ bool noinstr sev_es_active(void)
>   	return sev_status & MSR_AMD64_SEV_ES_ENABLED;
>   }
>   
> -bool amd_protected_guest_has(unsigned long flag)
> -{
> -	switch (flag) {
> -	case VM_MEM_ENCRYPT:
> -	case VM_MEM_ENCRYPT_ACTIVE:
> -		return true;
> -	}
> -
> -	return false;
> -}
> -EXPORT_SYMBOL_GPL(amd_protected_guest_has);
> -
>   /* Override for DMA direct allocation check - AMD specific initialization */
>   bool amd_force_dma_unencrypted(struct device *dev)
>   {
> diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
> index 6855d5b3e244..bb4b1a06b21f 100644
> --- a/include/linux/protected_guest.h
> +++ b/include/linux/protected_guest.h
> @@ -2,7 +2,9 @@
>   #ifndef _LINUX_PROTECTED_GUEST_H
>   #define _LINUX_PROTECTED_GUEST_H 1
>   
> -#include <linux/mem_encrypt.h>
> +#include <asm/processor.h>
> +#include <asm/tdx.h>
> +#include <asm/sev.h>
>   
>   /* Protected Guest Feature Flags (leave 0-0xff for arch specific flags) */
>   
> @@ -20,23 +22,18 @@
>   #define VM_DISABLE_UNCORE_SUPPORT	0x105
>   
>   #if defined(CONFIG_INTEL_TDX_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
> -
> -#include <asm/tdx.h>
> -
>   static inline bool protected_guest_has(unsigned long flag)
>   {
>   	if (is_tdx_guest())
>   		return tdx_protected_guest_has(flag);
> -	else if (mem_encrypt_active())
> -		return amd_protected_guest_has(flag);
> +	else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
> +		return sev_protected_guest_has(flag);

Since you are checking for AMD vendor ID, why not use amd_protected_guest_has()?

>   
>   	return false;
>   }
>   
>   #else
> -
>   static inline bool protected_guest_has(unsigned long flag) { return false; }
> -
>   #endif
>   
> -#endif
> +#endif /* _LINUX_PROTECTED_GUEST_H */
> 
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-03 18:33                 ` [RFC v2-fix-v2 " Kuppuswamy, Sathyanarayanan
@ 2021-06-03 18:41                   ` Borislav Petkov
  2021-06-03 18:54                     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-06-03 18:41 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Tom Lendacky

Sathya,

please trim your mails when you reply, like I've done in this reply.

Thx.

On Thu, Jun 03, 2021 at 11:33:53AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> I assume this file will get compiled for both SEV and SME cases.

Yap.

> Since you are checking for AMD vendor ID, why not use amd_protected_guest_has()?

Because, as Sean already told you, we should either stick to the
technologies: TDX or SEV or to the vendors: Intel or AMD - but not
either or.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-03 18:41                   ` Borislav Petkov
@ 2021-06-03 18:54                     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-03 18:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Tom Lendacky



On 6/3/21 11:41 AM, Borislav Petkov wrote:
>> Since you are checking for AMD vendor ID, why not use amd_protected_guest_has()?
> Because, as Sean already told you, we should either stick to the
> technologies: TDX or SEV or to the vendors: Intel or AMD - but not
> either or.

Ok. We can go with technologies for now. In future, if protected_guest_has() is extended
for other technologies like MKTME, then we can generalize it base on vendor.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn
  2021-06-03 18:15                 ` [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn Borislav Petkov
@ 2021-06-04 22:01                   ` Tom Lendacky
  2021-06-04 22:13                     ` Kuppuswamy, Sathyanarayanan
  2021-06-04 22:15                     ` Borislav Petkov
  2021-06-07 19:55                   ` Kirill A. Shutemov
  1 sibling, 2 replies; 381+ messages in thread
From: Tom Lendacky @ 2021-06-04 22:01 UTC (permalink / raw)
  To: Borislav Petkov, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 6/3/21 1:15 PM, Borislav Petkov wrote:
> From f1e9f051c86b09fe660f49b0307bc7c6cec5e6f4 Mon Sep 17 00:00:00 2001
> From: Borislav Petkov <bp@suse.de>
> Date: Thu, 3 Jun 2021 20:03:31 +0200
> Subject: Convert sme_active()
> 
>  	 */
> -	if (sme_active())
> +	if (protected_guest_has(VM_HOST_MEM_ENCRYPT))
>  		swiotlb = 1;

I still feel this is confusing. SME is a host/bare-metal technology, so
calling protected_guest_has() seems odd and using VM_HOST_MEM_ENCRYPT,
where I assume VM is short for virtual machine, also seems odd.

How about just protected_os_has()? Then you could have
- HOST_MEM_ENCRYPT  for host memory encryption
- GUEST_MEM_ENCRYPT for guest memory encryption
- MEM_ENCRYPT       for either host or guest memory encryption.

The first is analogous to sme_active(), the second to sev_active() and the
third to mem_encrypt_active(). Just my opinion, though...

>  
>  	return swiotlb;
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 01a224fdb897..3aa2658ced52 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -1409,6 +1409,11 @@ bool sev_protected_guest_has(unsigned long flag)
>  	case VM_MEM_ENCRYPT:
>  	case VM_MEM_ENCRYPT_ACTIVE:
>  		return true;
> +	case VM_HOST_MEM_ENCRYPT:
> +		return sme_me_mask && !sev_active();
> +	default:
> +		WARN_ON_ONCE(1);
> +		return false;

I don't think you want a WARN_ON_ONCE() here. The code will be written to
work with either SEV or TDX, so we shouldn't warn on a check for a TDX
supported feature when running on AMD (or vice-versa).

Thanks,
Tom


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn
  2021-06-04 22:01                   ` Tom Lendacky
@ 2021-06-04 22:13                     ` Kuppuswamy, Sathyanarayanan
  2021-06-04 22:15                     ` Borislav Petkov
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-04 22:13 UTC (permalink / raw)
  To: Tom Lendacky, Borislav Petkov
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel



On 6/4/21 3:01 PM, Tom Lendacky wrote:
>>   	 */
>> -	if (sme_active())
>> +	if (protected_guest_has(VM_HOST_MEM_ENCRYPT))
>>   		swiotlb = 1;
> I still feel this is confusing. SME is a host/bare-metal technology, so
> calling protected_guest_has() seems odd and using VM_HOST_MEM_ENCRYPT,
> where I assume VM is short for virtual machine, also seems odd.
> 
> How about just protected_os_has()? Then you could have
> - HOST_MEM_ENCRYPT  for host memory encryption
> - GUEST_MEM_ENCRYPT for guest memory encryption
> - MEM_ENCRYPT       for either host or guest memory encryption.
> 
> The first is analogous to sme_active(), the second to sev_active() and the
> third to mem_encrypt_active(). Just my opinion, though...
> 

I am not sure whether OS makes sense here. But I am fine with it if
it is maintainers choice.

Other option could be protected_boot_has()?

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn
  2021-06-04 22:01                   ` Tom Lendacky
  2021-06-04 22:13                     ` Kuppuswamy, Sathyanarayanan
@ 2021-06-04 22:15                     ` Borislav Petkov
  2021-06-04 23:31                       ` Tom Lendacky
  1 sibling, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-06-04 22:15 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

On Fri, Jun 04, 2021 at 05:01:31PM -0500, Tom Lendacky wrote:
> The first is analogous to sme_active(), the second to sev_active() and the
> third to mem_encrypt_active(). Just my opinion, though...

Yeah, or cc_has() where "cc" means "confidential computing". Or "coco"...

Yeah, no good idea yet.

> I don't think you want a WARN_ON_ONCE() here. The code will be written to
> work with either SEV or TDX, so we shouldn't warn on a check for a TDX
> supported feature when running on AMD (or vice-versa).

That's an AMD-specific path so it would warn only when a flag is used
which is unknown/unused yet on AMD.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn
  2021-06-04 22:15                     ` Borislav Petkov
@ 2021-06-04 23:31                       ` Tom Lendacky
  2021-06-05 11:03                         ` Borislav Petkov
  0 siblings, 1 reply; 381+ messages in thread
From: Tom Lendacky @ 2021-06-04 23:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

On 6/4/21 5:15 PM, Borislav Petkov wrote:
> On Fri, Jun 04, 2021 at 05:01:31PM -0500, Tom Lendacky wrote:
>> The first is analogous to sme_active(), the second to sev_active() and the
>> third to mem_encrypt_active(). Just my opinion, though...
> 
> Yeah, or cc_has() where "cc" means "confidential computing". Or "coco"...
> 
> Yeah, no good idea yet.
> 
>> I don't think you want a WARN_ON_ONCE() here. The code will be written to
>> work with either SEV or TDX, so we shouldn't warn on a check for a TDX
>> supported feature when running on AMD (or vice-versa).
> 
> That's an AMD-specific path so it would warn only when a flag is used
> which is unknown/unused yet on AMD.

But the check can happen on Intel or AMD. We have lots of checks for
sme_active() in common code that are executed on Intel today, but they
just return false. It's the same principle, you don't want to WARN on
those, just return false. E.g.:

	/* some common code path */
	if (cc_has(XYZ))
		do_y();

If Intel has XYZ but AMD does not, you don't want to WARN, just return false.

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v3 1/1] x86/tdx: Ignore WBINVD instruction for TDX guest
  2021-05-27  4:38                                               ` [RFC v2-fix-v3 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-06-05  3:35                                                 ` Dan Williams
  2021-06-08 21:35                                                   ` [RFC v2-fix-v3 1/1] x86/tdx: Skip " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-06-05  3:35 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Wed, May 26, 2021 at 9:38 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> Functionally only devices outside the CPU (such as DMA devices,
> or persistent memory for flushing) can notice the external side
> effects from WBINVD's cache flushing for write back mappings. One
> exception here is MKTME, but that is not visible outside the TDX
> module and not possible inside a TDX guest.
>
> Currently TDX does not support DMA, because DMA typically needs
> uncached access for MMIO, and the current TDX module always sets
> the IgnorePAT bit, which prevents that.
>
> Persistent memory is also currently not supported. There are some
> other cases that use WBINVD, such as the legacy ACPI sleeps, but
> these are all not supported in virtualization and there are better
> mechanisms inside a guest anyways. The guests usually are not
> aware of power management. Another code path that uses WBINVD is
> the MTRR driver, but EPT/virtualization always disables MTRRs so
> those are not needed. This all implies WBINVD is not needed with
> current TDX.
>
> So handle the WBINVD instruction as nop. Currently, #VE exception
> handler does not include any warning for WBINVD handling because
> ACPI reboot code uses it. This is the same behavior as KVM. It
> only allows WBINVD in a guest when the guest supports VT-d (=DMA),
> but just handles it as a nop if it doesn't .
>
> If TDX ever gets DMA support, or persistent memory support, or
> some other devices that can observe flushing side effects, a
> hypercall can be added to implement it similar to AMD-SEV. But
> current TDX does not need it.

Please just drop this patch. It serves no purpose especially when the
assertion is that nothing in TDX will miss WBINVD. Why would Linux
merge a patch that has no claimed end user benefit? If the only known
usage of WBINVD in a TDX guest is the ACPI reboot path then add an
is_protected_guest() to that one usage.

If a TDX guest runs an unexpected WBINVD that's a bug that needs a kernel fix.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 2/3] x86/tdx: Handle early IO operations
  2021-05-27  4:23           ` [RFC v2-fix-v1 2/3] x86/tdx: Handle early IO operations Kuppuswamy Sathyanarayanan
@ 2021-06-05  4:26             ` Williams, Dan J
  0 siblings, 0 replies; 381+ messages in thread
From: Williams, Dan J @ 2021-06-05  4:26 UTC (permalink / raw)
  To: sathyanarayanan.kuppuswamy, peterz, Luck, Tony, Hansen, Dave, luto
  Cc: kirill.shutemov, Raj, Ashok, seanjc, ak, knsathya, linux-kernel

On Wed, 2021-05-26 at 21:23 -0700, Kuppuswamy Sathyanarayanan wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Add an early #VE handler to convert early port IOs into TDCALLs.
> 
> TDX cannot do port IO directly. The TDX module triggers a #VE
> exception to let the guest kernel to emulate operations like
> IO ports, by converting them into TDCALLs to call the host.

s,kernel to emulate operations like IO ports,kernel emulate port I/O,

> 
> A fully featured #VE handler support for port IO will be added
> later in this patch set (in patch titled "x86/tdx: Handle port
> I/O). But it can be used only at later point in the boot
> process. So to support port IO in early boot code, add a
> minimal support in early exception handler. This is similar to
> what AMD SEV does.

Clarify "fully featured". I naively thought that the notes below about
trace and printk were implying that the full featured #VE handler will
use printk() so it can't also use #VE since printk() would recurse into
the #VE handler if the serial console is using port IO.

...but that does not seem to be the reason since:

http://lore.kernel.org/r/20210527042356.3983284-4-sathyanarayanan.kuppuswamy@linux.intel.com

...is also using #VE for port IO emulation?

> This is mainly to support early_printk's serial driver, as
> well as potentially the VGA driver (although it is expected
> not to be used).
> 
> The early handler only does IO calls and nothing else, and
> anything that goes wrong results in a normal early exception
> panic.
> 
> It cannot share the code paths with the normal #VE handler
> because it needs to avoid using trace calls or printk.
> 
> This early handler allows us to use the normal in*/out*
> macros without patching them for every driver. We don't
> expect IO port IO to be performance critical at all, so an
> extra #VE exception is no problem.

"There is no expectation that early port IO is performance critical, so
the #VE emulation cost is worth the simplicity benefit of not patching
out port IO usage in early code."

> There are also no concerns
> with nesting, since there should be no NMIs this early.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

Missing a Sathya signed-off-by.

> ---
>  arch/x86/include/asm/tdx.h |  6 ++++
>  arch/x86/kernel/head64.c   |  4 +++
>  arch/x86/kernel/tdx.c      | 59 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 69 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 53f844200909..e880a9dd40d3 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -72,6 +72,7 @@ u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
>                     struct tdx_hypercall_output *out);
>  
>  bool tdx_protected_guest_has(unsigned long flag);
> +bool tdg_early_handle_ve(struct pt_regs *regs);
>  
>  #else // !CONFIG_INTEL_TDX_GUEST
>  
> @@ -87,6 +88,11 @@ static inline bool tdx_protected_guest_has(unsigned long flag)
>         return false;
>  }
>  
> +static inline bool tdg_early_handle_ve(struct pt_regs *regs)
> +{
> +       return false;
> +}
> +
>  #endif /* CONFIG_INTEL_TDX_GUEST */
>  
>  #ifdef CONFIG_INTEL_TDX_GUEST_KVM
> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index 75f2401cb5db..23d1ff4626aa 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -410,6 +410,10 @@ void __init do_early_exception(struct pt_regs *regs, int trapnr)
>             trapnr == X86_TRAP_VC && handle_vc_boot_ghcb(regs))
>                 return;
>  
> +       if (IS_ENABLED(CONFIG_INTEL_TDX_GUEST) &&

This explicit IS_ENABLED() is unnecessary given tdg_early_handle_ve()
returns false in the CONFIG_INTEL_TDX_GUEST=n case as defined above.

> +           trapnr == X86_TRAP_VE && tdg_early_handle_ve(regs))
> +               return;
> +
>         early_fixup_exception(regs, trapnr);
>  }
>  
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 858e7f3d8f36..ca3442b7accf 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -13,6 +13,10 @@
>  #define TDINFO                         1
>  #define TDGETVEINFO                    3
>  
> +#define VE_GET_IO_TYPE(exit_qual)      (((exit_qual) & 8) ? 0 : 1)

How about VE_IS_IO_OUT()? To match its usage as a flag below...

> +#define VE_GET_IO_SIZE(exit_qual)      (((exit_qual) & 7) + 1)
> +#define VE_GET_PORT_NUM(exit_qual)     ((exit_qual) >> 16)
> +
>  static struct {
>         unsigned int gpa_width;
>         unsigned long attributes;
> @@ -256,6 +260,61 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
>         return ret;
>  }
>  
> +/*
> + * Handle early IO, mainly for early printks serial output.
> + * This avoids anything that doesn't work early on, like tracing
> + * or printks, by calling the low level functions directly. Any
> + * problems are handled by falling back to a standard early exception.
> + *
> + * Assumes the IO instruction was using ax, which is enforced
> + * by the standard io.h macros.
> + */
> +static __init bool tdx_early_io(struct ve_info *ve, struct pt_regs *regs)
> +{
> +       struct tdx_hypercall_output outh;
> +       int out = VE_GET_IO_TYPE(ve->exit_qual);
> +       int size = VE_GET_IO_SIZE(ve->exit_qual);
> +       int port = VE_GET_PORT_NUM(ve->exit_qual);
> +       int ret;
> +

...and if @out is a flag then the below can be simplified to:

        ret = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, out, port,
                              regs->ax, &outh);
        if (!out && !ret) {
                u64 mask = GENMASK(8 * size, 0);

                regs->ax &= ~mask;
                regs->ax |= outh.r11 & mask;
        }

        return !ret;


With the above fixups and clarifications you can add:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

...but the discussion here about fully featured leaves me confused
about the approach taken in the next patch.

> +       if (out) {
> +               ret = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION,
> +                                     size, 1, port,
> +                                     regs->ax,
> +                                     &outh);
> +       } else {
> +               u64 mask = GENMASK(8 * size, 0);
> +
> +               ret = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION,
> +                                     size, 0, port,
> +                                     regs->ax, &outh);
> +               if (!ret) {
> +                       regs->ax &= ~mask;
> +                       regs->ax |= outh.r11 & mask;
> +               }
> +       }
> +
> +       return !ret;
> +}
> +
> +/*
> + * Early #VE exception handler. Just used to handle port IOs
> + * for early_printk. If anything goes wrong handle it like
> + * a normal early exception.
> + */
> +__init bool tdg_early_handle_ve(struct pt_regs *regs)
> +{
> +       struct ve_info ve;
> +
> +       if (tdg_get_ve_info(&ve))
> +               return false;
> +
> +       if (ve.exit_reason == EXIT_REASON_IO_INSTRUCTION)
> +               return tdx_early_io(&ve, regs);
> +
> +       return false;
> +}
> +
>  void __init tdx_early_init(void)
>  {
>         if (!cpuid_has_tdx_guest())


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn
  2021-06-04 23:31                       ` Tom Lendacky
@ 2021-06-05 11:03                         ` Borislav Petkov
  2021-06-05 18:12                           ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-06-05 11:03 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

On Fri, Jun 04, 2021 at 06:31:03PM -0500, Tom Lendacky wrote:
> If Intel has XYZ but AMD does not, you don't want to WARN, just return false.

Aha, *now*, I see what you mean. Ok, so the reason why I added the
WARN is to sanity-check whether we're handling all possible VM_* or
PROT_GUEST_* flags properly and whether we're missing some. As a
debugging help. It'll get removed before applying I guess.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn
  2021-06-05 11:03                         ` Borislav Petkov
@ 2021-06-05 18:12                           ` Kuppuswamy, Sathyanarayanan
  2021-06-05 20:08                             ` Borislav Petkov
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-05 18:12 UTC (permalink / raw)
  To: Borislav Petkov, Tom Lendacky
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel



On 6/5/21 4:03 AM, Borislav Petkov wrote:
> Aha,*now*, I see what you mean. Ok, so the reason why I added the
> WARN is to sanity-check whether we're handling all possible VM_* or
> PROT_GUEST_* flags properly and whether we're missing some. As a
> debugging help. It'll get removed before applying I guess.

Borislav/Tom,

Any consensus on function name and flag prefix?

Currently suggested function names are,

cc_has() or protected_guest_has() or prot_guest_has() or protected_boot_has()

For flag prefix either PR_GUEST_* or CC_*

I am planning to submit another version of this patch with suggested fixes.
If we could reach some consensus on function and flag names, I can include
them in it. If not, I will submit next version without any renames.

Please let me know your comments.

BTW, my choice is protected_guest_has() or CC_has().

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 3/3] x86/tdx: Handle port I/O
  2021-05-27  4:23           ` [RFC v2-fix-v1 3/3] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
@ 2021-06-05 18:52             ` Dan Williams
  2021-06-05 20:08               ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-06-05 18:52 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Wed, May 26, 2021 at 9:24 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> TDX hypervisors cannot emulate instructions directly. This
> includes port IO which is normally emulated in the hypervisor.
> All port IO instructions inside TDX trigger the #VE exception
> in the guest and would be normally emulated there.
>
> For the really early code in the decompressor, #VE cannot be
> used because the IDT needed for handling the exception is not
> set-up, and some other infrastructure needed by the handler
> is missing. So to support port IO in decompressor code, add
> support for paravirt based I/O port virtualization.
>
> Also string I/O is not supported in TDX guest. So, unroll the
> string I/O operation into a loop operating on one element at
> a time. This method is similar to AMD SEV, so just extend the
> support for TDX guest platform.

Given early port IO is broken out in its own previous I think it makes
sense to break out the decompressor port IO enabling from final
runtime port IO support.

The argument in the previous patch about using #VE emulation in the
early code was collisions with trace and printk support in the "fully
featured" #VE handler later in the series. My interpretation of that
collision was due to the possibility of the #VE handler going into
infinite recursion if a printk in the handler triggered port IO. It
seems I do not have the right picture of the constraints. Given the
runtime kernel can direct replace in/out macros I would expect a
statement of the tradeoff with #VE emulation and why the post
decompressor code is still using emulation.

>
> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/boot/compressed/Makefile |  1 +
>  arch/x86/boot/compressed/tdcall.S |  3 ++
>  arch/x86/boot/compressed/tdx.c    | 28 ++++++++++++++++++
>  arch/x86/include/asm/io.h         |  7 +++--
>  arch/x86/include/asm/tdx.h        | 47 ++++++++++++++++++++++++++++++-
>  arch/x86/kernel/tdx.c             | 39 +++++++++++++++++++++++++
>  6 files changed, 122 insertions(+), 3 deletions(-)
>  create mode 100644 arch/x86/boot/compressed/tdcall.S
>
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index a2554621cefe..a944a2038797 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -97,6 +97,7 @@ endif
>
>  vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
>  vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
> +vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdcall.o
>
>  vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
>  efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
> diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
> new file mode 100644
> index 000000000000..aafadc136c88
> --- /dev/null
> +++ b/arch/x86/boot/compressed/tdcall.S
> @@ -0,0 +1,3 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include "../../kernel/tdcall.S"
> diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
> index 0a87c1775b67..cb20962c7da6 100644
> --- a/arch/x86/boot/compressed/tdx.c
> +++ b/arch/x86/boot/compressed/tdx.c
> @@ -4,6 +4,8 @@
>   */
>
>  #include <asm/tdx.h>
> +#include <asm/vmx.h>
> +#include <vdso/limits.h>
>
>  static int __ro_after_init tdx_guest = -1;
>
> @@ -30,3 +32,29 @@ bool is_tdx_guest(void)
>         return !!tdx_guest;
>  }
>
> +/*
> + * Helper function used for making hypercall for "out"
> + * instruction. It will be called from __out IO
> + * macro (in tdx.h).
> + */
> +void tdg_out(int size, int port, unsigned int value)
> +{
> +       __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 1,
> +                       port, value, NULL);
> +}
> +
> +/*
> + * Helper function used for making hypercall for "in"
> + * instruction. It will be called from __in IO macro
> + * (in tdx.h). If IO is failed, it will return all 1s.
> + */
> +unsigned int tdg_in(int size, int port)
> +{
> +       struct tdx_hypercall_output out = {0};
> +       int err;
> +
> +       err = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 0,
> +                             port, 0, &out);
> +
> +       return err ? UINT_MAX : out.r11;
> +}

The previous patch open coded tdg_{in,out} and this one provides
helpers. I think at a minimum they should be consistent and pick one
style.

> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> index ef7a686a55a9..daa75c8eef5d 100644
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -40,6 +40,7 @@
>
>  #include <linux/string.h>
>  #include <linux/compiler.h>
> +#include <linux/protected_guest.h>
>  #include <asm/page.h>
>  #include <asm/early_ioremap.h>
>  #include <asm/pgtable_types.h>
> @@ -309,7 +310,8 @@ static inline unsigned type in##bwl##_p(int port)                   \
>                                                                         \
>  static inline void outs##bwl(int port, const void *addr, unsigned long count) \
>  {                                                                      \
> -       if (sev_key_active()) {                                         \
> +       if (sev_key_active() ||                                         \
> +           protected_guest_has(VM_UNROLL_STRING_IO)) {                 \
>                 unsigned type *value = (unsigned type *)addr;           \
>                 while (count) {                                         \
>                         out##bwl(*value, port);                         \
> @@ -325,7 +327,8 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
>                                                                         \
>  static inline void ins##bwl(int port, void *addr, unsigned long count) \
>  {                                                                      \
> -       if (sev_key_active()) {                                         \
> +       if (sev_key_active() ||                                         \
> +           protected_guest_has(VM_UNROLL_STRING_IO)) {                 \
>                 unsigned type *value = (unsigned type *)addr;           \
>                 while (count) {                                         \
>                         *value = in##bwl(port);                         \
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index e880a9dd40d3..6ba2dcea533f 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -5,6 +5,8 @@
>
>  #define TDX_CPUID_LEAF_ID      0x21
>
> +#ifndef __ASSEMBLY__
> +
>  #ifdef CONFIG_INTEL_TDX_GUEST
>
>  #include <asm/cpufeature.h>
> @@ -74,6 +76,48 @@ u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
>  bool tdx_protected_guest_has(unsigned long flag);
>  bool tdg_early_handle_ve(struct pt_regs *regs);
>
> +void tdg_out(int size, int port, unsigned int value);
> +unsigned int tdg_in(int size, int port);
> +
> +/* Helper function for converting {b,w,l} to byte size */
> +static inline int tdx_get_iosize(char *str)
> +{
> +       if (str[0] == 'w')
> +               return 2;
> +       else if (str[0] == 'l')
> +               return 4;
> +
> +       return 1;
> +}

This seems like an unnecessary novelty. The BUILDIO() macro in
arch/x86/include/asm/io.h takes a type argument, why can't the size be
explicitly specified rather than inferred from string parsing?

> +
> +/*
> + * To support I/O port access in decompressor or early kernel init
> + * code, since #VE exception handler cannot be used, use paravirt
> + * model to implement __in/__out macros which will in turn be used
> + * by in{b,w,l}()/out{b,w,l} I/O helper macros used in kernel. You
> + * can find the __in/__out macro usage in arch/x86/include/asm/io.h
> + */
> +#ifdef BOOT_COMPRESSED_MISC_H
> +#define __out(bwl, bw)                                                 \
> +do {                                                                   \
> +       if (is_tdx_guest()) {                                           \
> +               tdg_out(tdx_get_iosize(#bwl), port, value);             \
> +       } else {                                                        \
> +               asm volatile("out" #bwl " %" #bw "0, %w1" : :           \
> +                               "a"(value), "Nd"(port));                \
> +       }                                                               \
> +} while (0)
> +#define __in(bwl, bw)                                                  \
> +do {                                                                   \
> +       if (is_tdx_guest()) {                                           \
> +               value = tdg_in(tdx_get_iosize(#bwl), port);             \
> +       } else {                                                        \
> +               asm volatile("in" #bwl " %w1, %" #bw "0" :              \
> +                               "=a"(value) : "Nd"(port));              \
> +       }                                                               \
> +} while (0)
> +#endif
> +
>  #else // !CONFIG_INTEL_TDX_GUEST
>
>  static inline bool is_tdx_guest(void)
> @@ -161,6 +205,7 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
>  {
>         return -ENODEV;
>  }
> -#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
>
> +#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
> +#endif /* __ASSEMBLY__ */
>  #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index ca3442b7accf..4a84487ee8ff 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -202,6 +202,42 @@ static void tdg_handle_cpuid(struct pt_regs *regs)
>         regs->dx = out.r15;
>  }
>
> +void tdg_out(int size, int port, unsigned int value)
> +{
> +       tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 1, port, value);
> +}
> +
> +unsigned int tdg_in(int size, int port)
> +{
> +       struct tdx_hypercall_output out = {0};
> +       u64 err;
> +
> +       err = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 0,
> +                             port, 0, &out);
> +
> +       return err ? UINT_MAX : out.r11;
> +}
> +
> +static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
> +{
> +       bool string = exit_qual & 16;
> +       int out, size, port;
> +
> +       /* I/O strings ops are unrolled at build time. */
> +       BUG_ON(string);

...and here is why I think the WBINVD patch is bogus, or at least
inconsistent with the decision taken here. If it's ok to BUG_ON
instructions that "can't happen" due to care taken to ensure build
time guarantees then it is ok to skip WBINVD handling with the same
care taken to prevent its usage at build time.

> +
> +       out = VE_GET_IO_TYPE(exit_qual);
> +       size = VE_GET_IO_SIZE(exit_qual);
> +       port = VE_GET_PORT_NUM(exit_qual);
> +
> +       if (out) {
> +               tdg_out(size, port, regs->ax);
> +       } else {
> +               regs->ax &= ~GENMASK(8 * size, 0);
> +               regs->ax |= tdg_in(size, port) & GENMASK(8 * size, 0);
> +       }
> +}
> +
>  unsigned long tdg_get_ve_info(struct ve_info *ve)
>  {
>         u64 ret;
> @@ -248,6 +284,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
>         case EXIT_REASON_CPUID:
>                 tdg_handle_cpuid(regs);
>                 break;
> +       case EXIT_REASON_IO_INSTRUCTION:
> +               tdg_handle_io(regs, ve->exit_qual);
> +               break;
>         default:
>                 pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
>                 return -EFAULT;
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 3/3] x86/tdx: Handle port I/O
  2021-06-05 18:52             ` Dan Williams
@ 2021-06-05 20:08               ` Kuppuswamy, Sathyanarayanan
  2021-06-05 21:08                 ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-05 20:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List



On 6/5/21 11:52 AM, Dan Williams wrote:
> On Wed, May 26, 2021 at 9:24 PM Kuppuswamy Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>>
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> TDX hypervisors cannot emulate instructions directly. This
>> includes port IO which is normally emulated in the hypervisor.
>> All port IO instructions inside TDX trigger the #VE exception
>> in the guest and would be normally emulated there.
>>
>> For the really early code in the decompressor, #VE cannot be
>> used because the IDT needed for handling the exception is not
>> set-up, and some other infrastructure needed by the handler
>> is missing. So to support port IO in decompressor code, add
>> support for paravirt based I/O port virtualization.
>>
>> Also string I/O is not supported in TDX guest. So, unroll the
>> string I/O operation into a loop operating on one element at
>> a time. This method is similar to AMD SEV, so just extend the
>> support for TDX guest platform.
> 
> Given early port IO is broken out in its own previous I think it makes
> sense to break out the decompressor port IO enabling from final
> runtime port IO support.

Patch titled "x86/tdx: Handle early IO operations" mainly adds
IO #VE support in early exception handler. Decompression code IO
support does not have dependency on it. You still think it is
better to move it that patch?

> 
> The argument in the previous patch about using #VE emulation in the
> early code was collisions with trace and printk support in the "fully
> featured" #VE handler later in the series. My interpretation of that
> collision was due to the possibility of the #VE handler going into
> infinite recursion if a printk in the handler triggered port IO. It

No. AFAIK, It has nothing to do with infinite recursion. We are just
highlighting the fact that when kernel uses early exception handler
support, we cannot use code path that enables tracing support. So we
use simplest way to trigger IO hypercalls.

if (early #VE exception path)
     handle_io_ve()
         __tdx_hypercall

if (normal #VE path)
     handle_io_ve()
         __tdx_hypercall (current version)
	// Later on when adding tracing support, we will replace it
	// with trace hypercalls.
	__trace_tdx_hypercall

As you can see in above design flow, later on when adding tracing
support we will have split the early #IO handling code from
normal IO handling code. So instead of using common code now and
refactor it later on, we just use different code path for both
of them.

> seems I do not have the right picture of the constraints. Given the
> runtime kernel can direct replace in/out macros I would expect a
> statement of the tradeoff with #VE emulation and why the post
> decompressor code is still using emulation.

Currently decompression code cannot use #VE based IO emulation. It does
not know how to handle #VE exceptions. Also, It is much easier to replace
IO calls with TDX hypercalls in decompression code when compared with
teaching how to handle #VE exceptions in decompression code.

> 
>>
>> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>> ---
>>   arch/x86/boot/compressed/Makefile |  1 +
>>   arch/x86/boot/compressed/tdcall.S |  3 ++
>>   arch/x86/boot/compressed/tdx.c    | 28 ++++++++++++++++++
>>   arch/x86/include/asm/io.h         |  7 +++--
>>   arch/x86/include/asm/tdx.h        | 47 ++++++++++++++++++++++++++++++-
>>   arch/x86/kernel/tdx.c             | 39 +++++++++++++++++++++++++
>>   6 files changed, 122 insertions(+), 3 deletions(-)
>>   create mode 100644 arch/x86/boot/compressed/tdcall.S
>>
>> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
>> index a2554621cefe..a944a2038797 100644
>> --- a/arch/x86/boot/compressed/Makefile
>> +++ b/arch/x86/boot/compressed/Makefile
>> @@ -97,6 +97,7 @@ endif
>>

>>   static int __ro_after_init tdx_guest = -1;
>>
>> @@ -30,3 +32,29 @@ bool is_tdx_guest(void)
>>          return !!tdx_guest;
>>   }
>>
>> +/*
>> + * Helper function used for making hypercall for "out"
>> + * instruction. It will be called from __out IO
>> + * macro (in tdx.h).
>> + */
>> +void tdg_out(int size, int port, unsigned int value)
>> +{
>> +       __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 1,
>> +                       port, value, NULL);
>> +}
>> +
>> +/*
>> + * Helper function used for making hypercall for "in"
>> + * instruction. It will be called from __in IO macro
>> + * (in tdx.h). If IO is failed, it will return all 1s.
>> + */
>> +unsigned int tdg_in(int size, int port)
>> +{
>> +       struct tdx_hypercall_output out = {0};
>> +       int err;
>> +
>> +       err = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 0,
>> +                             port, 0, &out);
>> +
>> +       return err ? UINT_MAX : out.r11;
>> +}
> 
> The previous patch open coded tdg_{in,out} and this one provides
> helpers. I think at a minimum they should be consistent and pick one
> style.

As I have mentioned above, early IO #VE handler is a special case. we
don't want to complicate its code path with debug or tracing support.
So it is not a good comparison target.

In this case, the reason for adding helper function is to make it easier
for calling it from tdx.h.

> 
>> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
>> index ef7a686a55a9..daa75c8eef5d 100644
>> --- a/arch/x86/include/asm/io.h
>> +++ b/arch/x86/include/asm/io.h
>> @@ -40,6 +40,7 @@
>>

snip

>> +
>> +/* Helper function for converting {b,w,l} to byte size */
>> +static inline int tdx_get_iosize(char *str)
>> +{
>> +       if (str[0] == 'w')
>> +               return 2;
>> +       else if (str[0] == 'l')
>> +               return 4;
>> +
>> +       return 1;
>> +}
> 
> This seems like an unnecessary novelty. The BUILDIO() macro in
> arch/x86/include/asm/io.h takes a type argument, why can't the size be
> explicitly specified rather than inferred from string parsing?

I don't want to make changes to generic macros in io.h if it can be
avoided. It follows similar argument/type in all arch/* code. Also, it
is easier to handle TDX as a special case here.


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn
  2021-06-05 18:12                           ` Kuppuswamy, Sathyanarayanan
@ 2021-06-05 20:08                             ` Borislav Petkov
  0 siblings, 0 replies; 381+ messages in thread
From: Borislav Petkov @ 2021-06-05 20:08 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Tom Lendacky, Peter Zijlstra, Andy Lutomirski, Dave Hansen,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

On Sat, Jun 05, 2021 at 11:12:57AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> cc_has() or protected_guest_has() or prot_guest_has() or protected_boot_has()

Even if I still think it is not optimal, prot_guest_has() seems to be
best what we have because protected_guest_has() together with the flag
will become just too long to scan at a quick glance. And if you have to
do two tests, you'd have to break the line.

> For flag prefix either PR_GUEST_* or CC_*

PR_GUEST_* sounds ok to me.

The "cc" prefix stuff is nice and short but it doesn't say what it means
because it is simply too short. And code readability is very important.

I'd say.

Still open for better suggestions though.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 3/3] x86/tdx: Handle port I/O
  2021-06-05 20:08               ` Kuppuswamy, Sathyanarayanan
@ 2021-06-05 21:08                 ` Dan Williams
  2021-06-07 16:24                   ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-06-05 21:08 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Sat, Jun 5, 2021 at 1:08 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
>
>
> On 6/5/21 11:52 AM, Dan Williams wrote:
> > On Wed, May 26, 2021 at 9:24 PM Kuppuswamy Sathyanarayanan
> > <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> >>
> >> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >>
> >> TDX hypervisors cannot emulate instructions directly. This
> >> includes port IO which is normally emulated in the hypervisor.
> >> All port IO instructions inside TDX trigger the #VE exception
> >> in the guest and would be normally emulated there.
> >>
> >> For the really early code in the decompressor, #VE cannot be
> >> used because the IDT needed for handling the exception is not
> >> set-up, and some other infrastructure needed by the handler
> >> is missing. So to support port IO in decompressor code, add
> >> support for paravirt based I/O port virtualization.
> >>
> >> Also string I/O is not supported in TDX guest. So, unroll the
> >> string I/O operation into a loop operating on one element at
> >> a time. This method is similar to AMD SEV, so just extend the
> >> support for TDX guest platform.
> >
> > Given early port IO is broken out in its own previous I think it makes
> > sense to break out the decompressor port IO enabling from final
> > runtime port IO support.
>
> Patch titled "x86/tdx: Handle early IO operations" mainly adds
> IO #VE support in early exception handler. Decompression code IO
> support does not have dependency on it. You still think it is
> better to move it that patch?
>

No, I was suggesting three patches instead of 2:

early
decompressor
final-runtime

> >
> > The argument in the previous patch about using #VE emulation in the
> > early code was collisions with trace and printk support in the "fully
> > featured" #VE handler later in the series. My interpretation of that
> > collision was due to the possibility of the #VE handler going into
> > infinite recursion if a printk in the handler triggered port IO. It
>
> No. AFAIK, It has nothing to do with infinite recursion. We are just
> highlighting the fact that when kernel uses early exception handler
> support, we cannot use code path that enables tracing support. So we
> use simplest way to trigger IO hypercalls.

Ok, then how does this approach handle printk from the #VE handler if
printk issues port IO?

>
> if (early #VE exception path)
>      handle_io_ve()
>          __tdx_hypercall
>
> if (normal #VE path)
>      handle_io_ve()
>          __tdx_hypercall (current version)
>         // Later on when adding tracing support, we will replace it
>         // with trace hypercalls.
>         __trace_tdx_hypercall
>
> As you can see in above design flow, later on when adding tracing
> support we will have split the early #IO handling code from
> normal IO handling code. So instead of using common code now and
> refactor it later on, we just use different code path for both
> of them.

Could you put that in the changelog, it was non-obvious to me.

> > seems I do not have the right picture of the constraints. Given the
> > runtime kernel can direct replace in/out macros I would expect a
> > statement of the tradeoff with #VE emulation and why the post
> > decompressor code is still using emulation.
>
> Currently decompression code cannot use #VE based IO emulation. It does
> not know how to handle #VE exceptions. Also, It is much easier to replace
> IO calls with TDX hypercalls in decompression code when compared with
> teaching how to handle #VE exceptions in decompression code.

Ok, but that does not answer the background behind the decision to use
emulation rather than direct replacement of port IO instructions in
the final kernel runtime image.

This patch mixes those 2 concerns and I think it deserves to be broken
out and explained.

>
> >
> >>
> >> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> >> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> >> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> >> ---
> >>   arch/x86/boot/compressed/Makefile |  1 +
> >>   arch/x86/boot/compressed/tdcall.S |  3 ++
> >>   arch/x86/boot/compressed/tdx.c    | 28 ++++++++++++++++++
> >>   arch/x86/include/asm/io.h         |  7 +++--
> >>   arch/x86/include/asm/tdx.h        | 47 ++++++++++++++++++++++++++++++-
> >>   arch/x86/kernel/tdx.c             | 39 +++++++++++++++++++++++++
> >>   6 files changed, 122 insertions(+), 3 deletions(-)
> >>   create mode 100644 arch/x86/boot/compressed/tdcall.S
> >>
> >> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> >> index a2554621cefe..a944a2038797 100644
> >> --- a/arch/x86/boot/compressed/Makefile
> >> +++ b/arch/x86/boot/compressed/Makefile
> >> @@ -97,6 +97,7 @@ endif
> >>
>
> >>   static int __ro_after_init tdx_guest = -1;
> >>
> >> @@ -30,3 +32,29 @@ bool is_tdx_guest(void)
> >>          return !!tdx_guest;
> >>   }
> >>
> >> +/*
> >> + * Helper function used for making hypercall for "out"
> >> + * instruction. It will be called from __out IO
> >> + * macro (in tdx.h).
> >> + */
> >> +void tdg_out(int size, int port, unsigned int value)
> >> +{
> >> +       __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 1,
> >> +                       port, value, NULL);
> >> +}
> >> +
> >> +/*
> >> + * Helper function used for making hypercall for "in"
> >> + * instruction. It will be called from __in IO macro
> >> + * (in tdx.h). If IO is failed, it will return all 1s.
> >> + */
> >> +unsigned int tdg_in(int size, int port)
> >> +{
> >> +       struct tdx_hypercall_output out = {0};
> >> +       int err;
> >> +
> >> +       err = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 0,
> >> +                             port, 0, &out);
> >> +
> >> +       return err ? UINT_MAX : out.r11;
> >> +}
> >
> > The previous patch open coded tdg_{in,out} and this one provides
> > helpers. I think at a minimum they should be consistent and pick one
> > style.
>
> As I have mentioned above, early IO #VE handler is a special case. we
> don't want to complicate its code path with debug or tracing support.
> So it is not a good comparison target.

This patch and the last do the same thing in 2 different ways. One of
them should match the other even if the helpers are not directly
reused.

> In this case, the reason for adding helper function is to make it easier
> for calling it from tdx.h.
> >
> >> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> >> index ef7a686a55a9..daa75c8eef5d 100644
> >> --- a/arch/x86/include/asm/io.h
> >> +++ b/arch/x86/include/asm/io.h
> >> @@ -40,6 +40,7 @@
> >>
>
> snip
>
> >> +
> >> +/* Helper function for converting {b,w,l} to byte size */
> >> +static inline int tdx_get_iosize(char *str)
> >> +{
> >> +       if (str[0] == 'w')
> >> +               return 2;
> >> +       else if (str[0] == 'l')
> >> +               return 4;
> >> +
> >> +       return 1;
> >> +}
> >
> > This seems like an unnecessary novelty. The BUILDIO() macro in
> > arch/x86/include/asm/io.h takes a type argument, why can't the size be
> > explicitly specified rather than inferred from string parsing?
>
> I don't want to make changes to generic macros in io.h if it can be
> avoided. It follows similar argument/type in all arch/* code. Also, it
> is easier to handle TDX as a special case here.
>

What changes are you talking about to the generic macros? The BUILDIO
macro passes in a size parameter explicitly rather than inferring the
size from the string name of an argument. BUILDIO does not need to
change, it's backend just needs to do the right thing in the TDX case.

Otherwise, "I don't want to" is not a sufficient justification for
avoiding needlessly new design patterns.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/2] x86/sev-es: Abstract out MMIO instruction decoding
  2021-06-02 19:42       ` [RFC v2-fix-v2 1/2] x86/sev-es: Abstract out MMIO instruction decoding Kuppuswamy Sathyanarayanan
@ 2021-06-05 21:56         ` Dan Williams
  2021-06-08 15:59           ` [RFC v2-fix-v3 0/4] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-06-05 21:56 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Wed, Jun 2, 2021 at 12:42 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>

Lead in with the what because I read 2 paragraphs to figure out that
this was a prep patch.

"In preparation for sharing MMIO instruction decode between SEV-ES and
TDX factor out the common decode into a new insn_decode_mmio()
helper."

> For regular virtual machine, MMIO is handled by the VMM: KVM
> emulates instruction that caused MMIO. But, this model doesn't
> work for a secure VMs (like SEV or TDX) as VMM doesn't have
> access to the guest memory and register state. VMM needs
> assistance in handling MMIO: it induces exception in the guest.
> Guest has to decode the instruction and handle it on its own.
>
> Instruction decoding logic is similar between AMD SEV and TDX
> code. So extract the decoding code to insn-eval.c where it can
> be used by both SEV and TDX.
>
> This code adds no functional changes. It is only build-tested
> for SEV.

The diff is such that I could not verify "no functional change" change
without doing more careful analysis. Typically with non-trivial
refactoring they are split out over a few patches with a final removal
of replaced infra at the end. This does the entire conversion all at
once.

How about an approach that has vc_handle_mmio() handle
MMIO_DECODE_FAILED for missing support in the common helper until the
final patch that can do:

> +       mmio = insn_decode_mmio(insn, &bytes);
> +       if (mmio == MMIO_DECODE_FAILED)
> +               return ES_DECODE_FAILED;

...i.e. insn_decode_mmio() is finally prepared to handle all scenarios
and vc_handle_mmio_twobyte_ops() can finally be deleted. This also
helps a future bisect that finds "whoops, 'no functional changes' was
incorrect".

>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Joerg Roedel <jroedel@suse.de>

Missing Sathya signed-off-by...

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 3/3] x86/tdx: Handle port I/O
  2021-06-05 21:08                 ` Dan Williams
@ 2021-06-07 16:24                   ` Kuppuswamy, Sathyanarayanan
  2021-06-07 17:17                     ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-07 16:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List



On 6/5/21 2:08 PM, Dan Williams wrote:
> On Sat, Jun 5, 2021 at 1:08 PM Kuppuswamy, Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>>
>>
>>
>> On 6/5/21 11:52 AM, Dan Williams wrote:
>>> On Wed, May 26, 2021 at 9:24 PM Kuppuswamy Sathyanarayanan
>>> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>>>>
>>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>>
>>>> TDX hypervisors cannot emulate instructions directly. This
>>>> includes port IO which is normally emulated in the hypervisor.
>>>> All port IO instructions inside TDX trigger the #VE exception
>>>> in the guest and would be normally emulated there.
>>>>
>>>> For the really early code in the decompressor, #VE cannot be
>>>> used because the IDT needed for handling the exception is not
>>>> set-up, and some other infrastructure needed by the handler
>>>> is missing. So to support port IO in decompressor code, add
>>>> support for paravirt based I/O port virtualization.
>>>>
>>>> Also string I/O is not supported in TDX guest. So, unroll the
>>>> string I/O operation into a loop operating on one element at
>>>> a time. This method is similar to AMD SEV, so just extend the
>>>> support for TDX guest platform.
>>>
>>> Given early port IO is broken out in its own previous I think it makes
>>> sense to break out the decompressor port IO enabling from final
>>> runtime port IO support.
>>
>> Patch titled "x86/tdx: Handle early IO operations" mainly adds
>> IO #VE support in early exception handler. Decompression code IO
>> support does not have dependency on it. You still think it is
>> better to move it that patch?
>>
> 
> No, I was suggesting three patches instead of 2:

Ok. I will move it to separate patch.

> 

snip

>>
>> Currently decompression code cannot use #VE based IO emulation. It does
>> not know how to handle #VE exceptions. Also, It is much easier to replace
>> IO calls with TDX hypercalls in decompression code when compared with
>> teaching how to handle #VE exceptions in decompression code.
> 
> Ok, but that does not answer the background behind the decision to use
> emulation rather than direct replacement of port IO instructions in
> the final kernel runtime image.

The reason for using #VE based emulation is,

1. It does not require changes to all usages of emulated instructions in
    all the drivers.
2. Directly replacing instructions with TDX hypercalls will lead to bloated
    image. So we cannot universally adapt this approach.

Reason for not adapting to use #VE approach for decompression code is,

1. #VE handler support does not exist for de-compressor code.
2. Adding such support is more complex and in-efficient (just for
    single use case of handling IO instructions).

So we have replaced IO instructions with TDX hypercalls in decompression code.

Did it answer your query?

> 
> This patch mixes those 2 concerns and I think it deserves to be broken
> out and explained.
> 



>>>> +/*
>>>> + * Helper function used for making hypercall for "out"
>>>> + * instruction. It will be called from __out IO
>>>> + * macro (in tdx.h).
>>>> + */
>>>> +void tdg_out(int size, int port, unsigned int value)
>>>> +{
>>>> +       __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 1,
>>>> +                       port, value, NULL);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Helper function used for making hypercall for "in"
>>>> + * instruction. It will be called from __in IO macro
>>>> + * (in tdx.h). If IO is failed, it will return all 1s.
>>>> + */
>>>> +unsigned int tdg_in(int size, int port)
>>>> +{
>>>> +       struct tdx_hypercall_output out = {0};
>>>> +       int err;
>>>> +
>>>> +       err = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 0,
>>>> +                             port, 0, &out);
>>>> +
>>>> +       return err ? UINT_MAX : out.r11;
>>>> +}
>>>
>>> The previous patch open coded tdg_{in,out} and this one provides
>>> helpers. I think at a minimum they should be consistent and pick one
>>> style.
>>
>> As I have mentioned above, early IO #VE handler is a special case. we
>> don't want to complicate its code path with debug or tracing support.
>> So it is not a good comparison target.
> 
> This patch and the last do the same thing in 2 different ways. One of
> them should match the other even if the helpers are not directly
> reused.

I am not sure whether I understand your question. But if the question is
about different implementation, the difference is due to where it is
being called.

In early IO support patch, tdx_early_io() is called from #VE handler.
So there is extra buffer code to extract exception information from
struct ve_info.

In this case, the caller is __in/__out macros. So there is no need
for above mentioned buffer code.

Actual hypercall usage is similar in both cases.

> 
>> In this case, the reason for adding helper function is to make it easier
>> for calling it from tdx.h.
>>>
>>>> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
>>>> index ef7a686a55a9..daa75c8eef5d 100644
>>>> --- a/arch/x86/include/asm/io.h
>>>> +++ b/arch/x86/include/asm/io.h
>>>> @@ -40,6 +40,7 @@
>>>>
>>
>> snip
>>
>>>> +
>>>> +/* Helper function for converting {b,w,l} to byte size */
>>>> +static inline int tdx_get_iosize(char *str)
>>>> +{
>>>> +       if (str[0] == 'w')
>>>> +               return 2;
>>>> +       else if (str[0] == 'l')
>>>> +               return 4;
>>>> +
>>>> +       return 1;
>>>> +}
>>>
>>> This seems like an unnecessary novelty. The BUILDIO() macro in
>>> arch/x86/include/asm/io.h takes a type argument, why can't the size be
>>> explicitly specified rather than inferred from string parsing?
>>
>> I don't want to make changes to generic macros in io.h if it can be
>> avoided. It follows similar argument/type in all arch/* code. Also, it
>> is easier to handle TDX as a special case here.
>>
> 
> What changes are you talking about to the generic macros? The BUILDIO
> macro passes in a size parameter explicitly rather than inferring the
> size from the string name of an argument. BUILDIO does not need to
> change, it's backend just needs to do the right thing in the TDX case.
> 
> Otherwise, "I don't want to" is not a sufficient justification for
> avoiding needlessly new design patterns.

I hope this is what you meant?

--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -273,25 +273,25 @@ static inline bool sev_key_active(void) { return false; }
  #endif /* CONFIG_AMD_MEM_ENCRYPT */

  #ifndef __out
-#define __out(bwl, bw)                                                 \
+#define __out(bwl, bw, sz)                                             \
         asm volatile("out" #bwl " %" #bw "0, %w1" : : "a"(value), "Nd"(port))
  #endif

  #ifndef __in
-#define __in(bwl, bw)                                                  \
+#define __in(bwl, bw, sz)                                              \
         asm volatile("in" #bwl " %w1, %" #bw "0" : "=a"(value) : "Nd"(port))
  #endif

  #define BUILDIO(bwl, bw, type)                                         \
  static inline void out##bwl(unsigned type value, int port)             \
  {                                                                      \
-       __out(bwl, bw);                                                 \
+       __out(bwl, bw, sizeof(type));                                                   \
  }                                                                      \
                                                                         \
  static inline unsigned type in##bwl(int port)                          \
  {                                                                      \
         unsigned type value;                                            \
-       __in(bwl, bw);                                                  \
+       __in(bwl, bw, sizeof(type));                                                    \
         return value;                                                   \
  }

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 3/3] x86/tdx: Handle port I/O
  2021-06-07 16:24                   ` Kuppuswamy, Sathyanarayanan
@ 2021-06-07 17:17                     ` Dan Williams
  2021-06-07 21:52                       ` Kuppuswamy, Sathyanarayanan
  2021-06-08 15:40                       ` [RFC v2-fix-v2 0/3] " Kuppuswamy Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Dan Williams @ 2021-06-07 17:17 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Mon, Jun 7, 2021 at 9:25 AM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
>
>
> On 6/5/21 2:08 PM, Dan Williams wrote:
> > On Sat, Jun 5, 2021 at 1:08 PM Kuppuswamy, Sathyanarayanan
> > <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> >>
> >>
> >>
> >> On 6/5/21 11:52 AM, Dan Williams wrote:
> >>> On Wed, May 26, 2021 at 9:24 PM Kuppuswamy Sathyanarayanan
> >>> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> >>>>
> >>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >>>>
> >>>> TDX hypervisors cannot emulate instructions directly. This
> >>>> includes port IO which is normally emulated in the hypervisor.
> >>>> All port IO instructions inside TDX trigger the #VE exception
> >>>> in the guest and would be normally emulated there.
> >>>>
> >>>> For the really early code in the decompressor, #VE cannot be
> >>>> used because the IDT needed for handling the exception is not
> >>>> set-up, and some other infrastructure needed by the handler
> >>>> is missing. So to support port IO in decompressor code, add
> >>>> support for paravirt based I/O port virtualization.
> >>>>
> >>>> Also string I/O is not supported in TDX guest. So, unroll the
> >>>> string I/O operation into a loop operating on one element at
> >>>> a time. This method is similar to AMD SEV, so just extend the
> >>>> support for TDX guest platform.
> >>>
> >>> Given early port IO is broken out in its own previous I think it makes
> >>> sense to break out the decompressor port IO enabling from final
> >>> runtime port IO support.
> >>
> >> Patch titled "x86/tdx: Handle early IO operations" mainly adds
> >> IO #VE support in early exception handler. Decompression code IO
> >> support does not have dependency on it. You still think it is
> >> better to move it that patch?
> >>
> >
> > No, I was suggesting three patches instead of 2:
>
> Ok. I will move it to separate patch.
>
> >
>
> snip
>
> >>
> >> Currently decompression code cannot use #VE based IO emulation. It does
> >> not know how to handle #VE exceptions. Also, It is much easier to replace
> >> IO calls with TDX hypercalls in decompression code when compared with
> >> teaching how to handle #VE exceptions in decompression code.
> >
> > Ok, but that does not answer the background behind the decision to use
> > emulation rather than direct replacement of port IO instructions in
> > the final kernel runtime image.
>
> The reason for using #VE based emulation is,
>
> 1. It does not require changes to all usages of emulated instructions in
>     all the drivers.
> 2. Directly replacing instructions with TDX hypercalls will lead to bloated
>     image. So we cannot universally adapt this approach.
>
> Reason for not adapting to use #VE approach for decompression code is,
>
> 1. #VE handler support does not exist for de-compressor code.
> 2. Adding such support is more complex and in-efficient (just for
>     single use case of handling IO instructions).
>
> So we have replaced IO instructions with TDX hypercalls in decompression code.
>
> Did it answer your query?

Yes, all but the concern of printk recursion.

>
> >
> > This patch mixes those 2 concerns and I think it deserves to be broken
> > out and explained.
> >
>
>
>
> >>>> +/*
> >>>> + * Helper function used for making hypercall for "out"
> >>>> + * instruction. It will be called from __out IO
> >>>> + * macro (in tdx.h).
> >>>> + */
> >>>> +void tdg_out(int size, int port, unsigned int value)
> >>>> +{
> >>>> +       __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 1,
> >>>> +                       port, value, NULL);
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> + * Helper function used for making hypercall for "in"
> >>>> + * instruction. It will be called from __in IO macro
> >>>> + * (in tdx.h). If IO is failed, it will return all 1s.
> >>>> + */
> >>>> +unsigned int tdg_in(int size, int port)
> >>>> +{
> >>>> +       struct tdx_hypercall_output out = {0};
> >>>> +       int err;
> >>>> +
> >>>> +       err = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 0,
> >>>> +                             port, 0, &out);
> >>>> +
> >>>> +       return err ? UINT_MAX : out.r11;
> >>>> +}
> >>>
> >>> The previous patch open coded tdg_{in,out} and this one provides
> >>> helpers. I think at a minimum they should be consistent and pick one
> >>> style.
> >>
> >> As I have mentioned above, early IO #VE handler is a special case. we
> >> don't want to complicate its code path with debug or tracing support.
> >> So it is not a good comparison target.
> >
> > This patch and the last do the same thing in 2 different ways. One of
> > them should match the other even if the helpers are not directly
> > reused.
>
> I am not sure whether I understand your question. But if the question is
> about different implementation, the difference is due to where it is
> being called.
>
> In early IO support patch, tdx_early_io() is called from #VE handler.
> So there is extra buffer code to extract exception information from
> struct ve_info.
>
> In this case, the caller is __in/__out macros. So there is no need
> for above mentioned buffer code.
>
> Actual hypercall usage is similar in both cases.

It's similar when it can be identical. It simply looks like 2 authors
wrote 2 different code paths which is the case. Just huddle and code
it one way, I don't much care which.

>
> >
> >> In this case, the reason for adding helper function is to make it easier
> >> for calling it from tdx.h.
> >>>
> >>>> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> >>>> index ef7a686a55a9..daa75c8eef5d 100644
> >>>> --- a/arch/x86/include/asm/io.h
> >>>> +++ b/arch/x86/include/asm/io.h
> >>>> @@ -40,6 +40,7 @@
> >>>>
> >>
> >> snip
> >>
> >>>> +
> >>>> +/* Helper function for converting {b,w,l} to byte size */
> >>>> +static inline int tdx_get_iosize(char *str)
> >>>> +{
> >>>> +       if (str[0] == 'w')
> >>>> +               return 2;
> >>>> +       else if (str[0] == 'l')
> >>>> +               return 4;
> >>>> +
> >>>> +       return 1;
> >>>> +}
> >>>
> >>> This seems like an unnecessary novelty. The BUILDIO() macro in
> >>> arch/x86/include/asm/io.h takes a type argument, why can't the size be
> >>> explicitly specified rather than inferred from string parsing?
> >>
> >> I don't want to make changes to generic macros in io.h if it can be
> >> avoided. It follows similar argument/type in all arch/* code. Also, it
> >> is easier to handle TDX as a special case here.
> >>
> >
> > What changes are you talking about to the generic macros? The BUILDIO
> > macro passes in a size parameter explicitly rather than inferring the
> > size from the string name of an argument. BUILDIO does not need to
> > change, it's backend just needs to do the right thing in the TDX case.
> >
> > Otherwise, "I don't want to" is not a sufficient justification for
> > avoiding needlessly new design patterns.
>
> I hope this is what you meant?
>
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -273,25 +273,25 @@ static inline bool sev_key_active(void) { return false; }
>   #endif /* CONFIG_AMD_MEM_ENCRYPT */
>
>   #ifndef __out
> -#define __out(bwl, bw)                                                 \
> +#define __out(bwl, bw, sz)                                             \
>          asm volatile("out" #bwl " %" #bw "0, %w1" : : "a"(value), "Nd"(port))
>   #endif
>
>   #ifndef __in
> -#define __in(bwl, bw)                                                  \
> +#define __in(bwl, bw, sz)                                              \
>          asm volatile("in" #bwl " %w1, %" #bw "0" : "=a"(value) : "Nd"(port))
>   #endif
>
>   #define BUILDIO(bwl, bw, type)                                         \
>   static inline void out##bwl(unsigned type value, int port)             \
>   {                                                                      \
> -       __out(bwl, bw);                                                 \
> +       __out(bwl, bw, sizeof(type));                                                   \
>   }                                                                      \
>                                                                          \
>   static inline unsigned type in##bwl(int port)                          \
>   {                                                                      \
>          unsigned type value;                                            \
> -       __in(bwl, bw);                                                  \
> +       __in(bwl, bw, sizeof(type));                                                    \
>          return value;                                                   \
>   }

Looks good.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-03 18:14               ` Borislav Petkov
  2021-06-03 18:15                 ` [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn Borislav Petkov
  2021-06-03 18:33                 ` [RFC v2-fix-v2 " Kuppuswamy, Sathyanarayanan
@ 2021-06-07 18:01                 ` Kuppuswamy, Sathyanarayanan
  2021-06-07 18:26                   ` Borislav Petkov
  2 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-07 18:01 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Tom Lendacky



On 6/3/21 11:14 AM, Borislav Petkov wrote:
> On Tue, Jun 01, 2021 at 02:14:17PM -0700, Kuppuswamy Sathyanarayanan wrote:

snip

> diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
> index 6855d5b3e244..bb4b1a06b21f 100644
> --- a/include/linux/protected_guest.h
> +++ b/include/linux/protected_guest.h
> @@ -2,7 +2,9 @@
>   #ifndef _LINUX_PROTECTED_GUEST_H
>   #define _LINUX_PROTECTED_GUEST_H 1
>   
> -#include <linux/mem_encrypt.h>
> +#include <asm/processor.h>
> +#include <asm/tdx.h>
> +#include <asm/sev.h>
>   
>   /* Protected Guest Feature Flags (leave 0-0xff for arch specific flags) */
>   
> @@ -20,23 +22,18 @@
>   #define VM_DISABLE_UNCORE_SUPPORT	0x105
>   
>   #if defined(CONFIG_INTEL_TDX_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
> -
> -#include <asm/tdx.h>
> -

Why move this header outside CONFIG_INTEL_TDX_GUEST or CONFIG_AMD_MEM_ENCRYPT ifdef?

This header only exists in x86 arch code. So it is better to protect it with x86
specific header file.

>   static inline bool protected_guest_has(unsigned long flag)
>   {
>   	if (is_tdx_guest())
>   		return tdx_protected_guest_has(flag);
> -	else if (mem_encrypt_active())
> -		return amd_protected_guest_has(flag);
> +	else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
> +		return sev_protected_guest_has(flag);
>   
>   	return false;
>   }
>   
>   #else
> -
>   static inline bool protected_guest_has(unsigned long flag) { return false; }
> -
>   #endif
>   
> -#endif
> +#endif /* _LINUX_PROTECTED_GUEST_H */
> 
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-07 18:01                 ` Kuppuswamy, Sathyanarayanan
@ 2021-06-07 18:26                   ` Borislav Petkov
  2021-06-09 14:01                     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-06-07 18:26 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Tom Lendacky

On Mon, Jun 07, 2021 at 11:01:05AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> Why move this header outside CONFIG_INTEL_TDX_GUEST or
> CONFIG_AMD_MEM_ENCRYPT ifdef?

Because asm headers are usually included at the beginning of another,
possibly generic header. Unless you have a specially particular
reason to put them in additional guarding ifdeffery. Have a look at
include/linux/.

> This header only exists in x86 arch code. So it is better to protect
> it with x86 specific header file.

That doesn't sound like a special reason to me. And compilers are
usually very able at discarding unused symbols so I don't see a problem
with keeping all includes at the top, like it is usually done.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn
  2021-06-03 18:15                 ` [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn Borislav Petkov
  2021-06-04 22:01                   ` Tom Lendacky
@ 2021-06-07 19:55                   ` Kirill A. Shutemov
  2021-06-07 20:14                     ` Borislav Petkov
  1 sibling, 1 reply; 381+ messages in thread
From: Kirill A. Shutemov @ 2021-06-07 19:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Tom Lendacky

On Thu, Jun 03, 2021 at 08:15:46PM +0200, Borislav Petkov wrote:
> From f1e9f051c86b09fe660f49b0307bc7c6cec5e6f4 Mon Sep 17 00:00:00 2001
> From: Borislav Petkov <bp@suse.de>
> Date: Thu, 3 Jun 2021 20:03:31 +0200
> Subject: Convert sme_active()
> 
> diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
> index 9c80c68d75b5..1bb9f22629fc 100644
> --- a/arch/x86/include/asm/mem_encrypt.h
> +++ b/arch/x86/include/asm/mem_encrypt.h
> @@ -50,7 +50,6 @@ void __init mem_encrypt_free_decrypted_mem(void);
>  void __init mem_encrypt_init(void);
>  
>  void __init sev_es_init_vc_handling(void);
> -bool sme_active(void);
>  bool sev_active(void);
>  bool sev_es_active(void);
>  
> @@ -75,7 +74,6 @@ static inline void __init sme_encrypt_kernel(struct boot_params *bp) { }
>  static inline void __init sme_enable(struct boot_params *bp) { }
>  
>  static inline void sev_es_init_vc_handling(void) { }
> -static inline bool sme_active(void) { return false; }
>  static inline bool sev_active(void) { return false; }
>  static inline bool sev_es_active(void) { return false; }
>  
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index c078b0d3ab0e..1d88232146ab 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -387,7 +387,7 @@ void machine_kexec(struct kimage *image)
>  				       (unsigned long)page_list,
>  				       image->start,
>  				       image->preserve_context,
> -				       sme_active());
> +				       protected_guest_has(VM_HOST_MEM_ENCRYPT));
>  
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)

I think conversions like this are wrong: relocate_kernel(), which got
called here, only knows how to deal with SME, not how to handle some
generic case.

(After a quick check, looks like all conversions in the patch are wrong
for the same reason.)

If code is written to handle a specific technology we need to stick with
a check that makes it clear. Trying to make sound generic only leads to
confusion.

Also, we have host memory encryption that doesn't require any of this
code: TME makes the ectryption transparently to OS.

Maybe it's better to take a conservative path: keep a check specific until
we find it can serve more than one HW feature?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn
  2021-06-07 19:55                   ` Kirill A. Shutemov
@ 2021-06-07 20:14                     ` Borislav Petkov
  2021-06-07 22:26                       ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-06-07 20:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Tom Lendacky

On Mon, Jun 07, 2021 at 10:55:44PM +0300, Kirill A. Shutemov wrote:
> I think conversions like this are wrong: relocate_kernel(), which got
> called here, only knows how to deal with SME, not how to handle some
> generic case.

What do you mean wrong? Wrong for TDX?

If so, then that can be

protected_guest_has(SME)

or so, which would be false on Intel.

And this patch was only a mechanical conversion to see how it would look
like.

> If code is written to handle a specific technology we need to stick
> with a check that makes it clear. Trying to make sound generic only
> leads to confusion.

Sure, fine by me.

And I don't want a zoo of gazillion small checking functions per
technology. sev_<something>, tdx_<something>, yadda yadda.

So stuff better be unified. Even if you'd have vendor-specific defines
you hand into that function - and you will have such - it is still much
saner than what it turns into with the AMD side of things.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 3/3] x86/tdx: Handle port I/O
  2021-06-07 17:17                     ` Dan Williams
@ 2021-06-07 21:52                       ` Kuppuswamy, Sathyanarayanan
  2021-06-07 22:00                         ` Dan Williams
  2021-06-08 15:40                       ` [RFC v2-fix-v2 0/3] " Kuppuswamy Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-07 21:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List



On 6/7/21 10:17 AM, Dan Williams wrote:
>> Did it answer your query?
> Yes, all but the concern of printk recursion.

I think recursion is not possible because printk will
handle it (using console_lock). If another print is
triggered during the current printk handling, it will
be directed to logbuf and delayed.

https://elixir.bootlin.com/linux/latest/source/kernel/printk/printk_safe.c#L382

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 3/3] x86/tdx: Handle port I/O
  2021-06-07 21:52                       ` Kuppuswamy, Sathyanarayanan
@ 2021-06-07 22:00                         ` Dan Williams
  2021-06-08  2:57                           ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-06-07 22:00 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Mon, Jun 7, 2021 at 2:52 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
>
>
> On 6/7/21 10:17 AM, Dan Williams wrote:
> >> Did it answer your query?
> > Yes, all but the concern of printk recursion.
>
> I think recursion is not possible because printk will
> handle it (using console_lock). If another print is
> triggered during the current printk handling, it will
> be directed to logbuf and delayed.
>
> https://elixir.bootlin.com/linux/latest/source/kernel/printk/printk_safe.c#L382

That depends on printk_nmi_direct_enter() to set the context, wouldn't
an equivalent printk_ve_direct_enter() context flag be needed as well?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn
  2021-06-07 20:14                     ` Borislav Petkov
@ 2021-06-07 22:26                       ` Kuppuswamy, Sathyanarayanan
  2021-06-08 21:30                         ` [RFC v2-fix-v3 1/1] x86: Introduce generic protected guest abstraction Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-07 22:26 UTC (permalink / raw)
  To: Borislav Petkov, Kirill A. Shutemov
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Tom Lendacky



On 6/7/21 1:14 PM, Borislav Petkov wrote:
> On Mon, Jun 07, 2021 at 10:55:44PM +0300, Kirill A. Shutemov wrote:
>> I think conversions like this are wrong: relocate_kernel(), which got
>> called here, only knows how to deal with SME, not how to handle some
>> generic case.
> 
> What do you mean wrong? Wrong for TDX?
> 
> If so, then that can be
> 
> protected_guest_has(SME)
> 
> or so, which would be false on Intel.

I agree. Since most of the code changed in this patch is
not applicable to TDX, it might need product specific or
new function specific flags.

> 
> And this patch was only a mechanical conversion to see how it would look
> like.
> 
>> If code is written to handle a specific technology we need to stick
>> with a check that makes it clear. Trying to make sound generic only
>> leads to confusion.
> 
> Sure, fine by me.
> 
> And I don't want a zoo of gazillion small checking functions per
> technology. sev_<something>, tdx_<something>, yadda yadda.
> 
> So stuff better be unified. Even if you'd have vendor-specific defines
> you hand into that function - and you will have such - it is still much
> saner than what it turns into with the AMD side of things.

Agree. Currently we share code with AMD SEV in memory encryption support and
string I/O handling code. So defining common flag for such code is
useful.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 3/3] x86/tdx: Handle port I/O
  2021-06-07 22:00                         ` Dan Williams
@ 2021-06-08  2:57                           ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-06-08  2:57 UTC (permalink / raw)
  To: Dan Williams, Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List


ps://elixir.bootlin.com/linux/latest/source/kernel/printk/printk_safe.c#L382

> That depends on printk_nmi_direct_enter() to set the context, wouldn't
> an equivalent printk_ve_direct_enter() context flag be needed as well?

Even without it the console semaphore is always trylocked. So recursion 
is just not possible.

What would be possible is a endless loop (printk adding more information 
to the log buffer, which is then printed etc.), but that's true for 
everywhere in the console/serial driver subsystem.


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 0/3] x86/tdx: Handle port I/O
  2021-06-07 17:17                     ` Dan Williams
  2021-06-07 21:52                       ` Kuppuswamy, Sathyanarayanan
@ 2021-06-08 15:40                       ` Kuppuswamy Sathyanarayanan
  2021-06-08 15:40                         ` [RFC v2-fix-v2 1/3] x86/tdx: Handle port I/O in decompression code Kuppuswamy Sathyanarayanan
                                           ` (2 more replies)
  1 sibling, 3 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-08 15:40 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

This patchset addresses the review comments in the patch titled
"[RFC v2 14/32] x86/tdx: Handle port I/O". Since it requires
patch split, sending these together.

Changes since RFC v2-fix-v1:
 * Splitted TDX decompression IO support into a seperate patch.
 * Implemented tdg_handle_io() and tdx_early_io() in the similar
   way as per review suggestion.
 * Added VE_IS_IO_OUT() macro as per review suggestion.
 * Added VE_IS_IO_STRING() to check the string I/O case in
   tdx_early_io()
 * Removed helper function tdg_in() and tdg_out() and directly
   called IO hypercall to make the implementation uniform in
   decompression code, early IO code and normal IO handler code.

Changes since RFC v2:
 * Removed assembly implementation of port IO emulation code
   and modified __in/__out IO helpers to directly call C function
   for in/out instruction emulation in decompression code.
 * Added helper function tdx_get_iosize() to make it easier for
   calling tdg_out/tdg_int() C functions from decompression code.
 * Added support for early exception handler to support IO
   instruction emulation in early boot kernel code.
 * Removed alternative_ usage and made kernel only use #VE based
   IO instruction emulation support outside the decompression module.
 * Added support for protection_guest_has() API to generalize
   AMD SEV/TDX specific initialization code in common drivers.
 * Fixed commit log and comments as per review comments.


Andi Kleen (1):
  x86/tdx: Handle early IO operations

Kirill A. Shutemov (1):
  x86/tdx: Handle port I/O

Kuppuswamy Sathyanarayanan (1):
  x86/tdx: Handle port I/O in decompression code

 arch/x86/boot/compressed/Makefile |  1 +
 arch/x86/boot/compressed/tdcall.S |  3 ++
 arch/x86/include/asm/io.h         | 15 +++---
 arch/x86/include/asm/tdx.h        | 54 ++++++++++++++++++++
 arch/x86/kernel/head64.c          |  3 ++
 arch/x86/kernel/tdx.c             | 84 +++++++++++++++++++++++++++++++
 6 files changed, 154 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdcall.S

-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 1/3] x86/tdx: Handle port I/O in decompression code
  2021-06-08 15:40                       ` [RFC v2-fix-v2 0/3] " Kuppuswamy Sathyanarayanan
@ 2021-06-08 15:40                         ` Kuppuswamy Sathyanarayanan
  2021-06-08 23:12                           ` Dan Williams
  2021-06-08 15:40                         ` [RFC v2-fix-v2 2/3] x86/tdx: Handle early IO operations Kuppuswamy Sathyanarayanan
  2021-06-08 15:40                         ` [RFC v2-fix-v2 3/3] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
  2 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-08 15:40 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Add support to replace in/out instructions in
decompression code with TDX IO hypercalls.

TDX cannot do port IO directly. The TDX module triggers
a #VE exception to let the guest kernel to emulate port
I/O, by converting them into TDX hypercalls to call the
host.

But for the really early code in the decompressor, #VE
cannot be used because the IDT needed for handling the
exception is not set-up, and some other infrastructure
needed by the handler is missing. So to support port IO
in decompressor code, directly replace in/out instructions
with TDX IO hypercalls. This can beeasily achieved by
modifying __in/__out macros.

Also, since TDX IO hypercall requires an IO size parameter,
modify __in/__out macros to accept size as input parameter.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile |  1 +
 arch/x86/boot/compressed/tdcall.S |  3 ++
 arch/x86/include/asm/io.h         |  9 +++---
 arch/x86/include/asm/tdx.h        | 48 +++++++++++++++++++++++++++++++
 4 files changed, 57 insertions(+), 4 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdcall.S

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 22a2a6cc2ab4..1bfe30ebadbe 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -99,6 +99,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdcall.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
new file mode 100644
index 000000000000..aafadc136c88
--- /dev/null
+++ b/arch/x86/boot/compressed/tdcall.S
@@ -0,0 +1,3 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include "../../kernel/tdcall.S"
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index be96bf1e667a..391205dace98 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -40,6 +40,7 @@
 
 #include <linux/string.h>
 #include <linux/compiler.h>
+#include <linux/protected_guest.h>
 #include <asm/page.h>
 #include <asm/early_ioremap.h>
 #include <asm/pgtable_types.h>
@@ -272,25 +273,25 @@ static inline bool sev_key_active(void) { return false; }
 #endif /* CONFIG_AMD_MEM_ENCRYPT */
 
 #ifndef __out
-#define __out(bwl, bw)							\
+#define __out(bwl, bw, sz)						\
 	asm volatile("out" #bwl " %" #bw "0, %w1" : : "a"(value), "Nd"(port))
 #endif
 
 #ifndef __in
-#define __in(bwl, bw)							\
+#define __in(bwl, bw, sz)						\
 	asm volatile("in" #bwl " %w1, %" #bw "0" : "=a"(value) : "Nd"(port))
 #endif
 
 #define BUILDIO(bwl, bw, type)						\
 static inline void out##bwl(unsigned type value, int port)		\
 {									\
-	__out(bwl, bw);							\
+	__out(bwl, bw, sizeof(type));					\
 }									\
 									\
 static inline unsigned type in##bwl(int port)				\
 {									\
 	unsigned type value;						\
-	__in(bwl, bw);							\
+	__in(bwl, bw, sizeof(type));					\
 	return value;							\
 }									\
 									\
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index cbfe7479f2a3..ac38ed5b50db 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -9,6 +9,8 @@
 
 #include <asm/cpufeature.h>
 #include <linux/types.h>
+#include <vdso/limits.h>
+#include <asm/vmx.h>
 
 /*
  * Used in __tdx_module_call() helper function to gather the
@@ -73,6 +75,52 @@ u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
 
 bool tdx_protected_guest_has(unsigned long flag);
 
+/*
+ * To support I/O port access in decompressor or early kernel init
+ * code, since #VE exception handler cannot be used, use paravirt
+ * model to implement __in/__out macros which will in turn be used
+ * by in{b,w,l}()/out{b,w,l} I/O helper macros used in kernel. You
+ * can find the __in/__out macro usage in arch/x86/include/asm/io.h
+ */
+#ifdef BOOT_COMPRESSED_MISC_H
+
+/*
+ * Helper function used for making hypercall for "in"
+ * instruction. It will be called from __in IO macro
+ * If IO is failed, it will return all 1s.
+ */
+static inline unsigned int tdg_in(int size, int port)
+{
+	struct tdx_hypercall_output out = {0};
+	int err;
+
+	err = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, 0,
+			      port, 0, &out);
+
+	return err ? UINT_MAX : out.r11;
+}
+
+#define __out(bwl, bw, sz)						\
+do {									\
+	if (is_tdx_guest()) {						\
+		__tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, sz, 1,	\
+				port, value, NULL);			\
+	} else {							\
+		asm volatile("out" #bwl " %" #bw "0, %w1" : :		\
+				"a"(value), "Nd"(port));		\
+	}								\
+} while (0)
+#define __in(bwl, bw, sz)						\
+do {									\
+	if (is_tdx_guest()) {						\
+		value = tdg_in(sz, port);				\
+	} else {							\
+		asm volatile("in" #bwl " %w1, %" #bw "0" :		\
+				"=a"(value) : "Nd"(port));		\
+	}								\
+} while (0)
+#endif
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 2/3] x86/tdx: Handle early IO operations
  2021-06-08 15:40                       ` [RFC v2-fix-v2 0/3] " Kuppuswamy Sathyanarayanan
  2021-06-08 15:40                         ` [RFC v2-fix-v2 1/3] x86/tdx: Handle port I/O in decompression code Kuppuswamy Sathyanarayanan
@ 2021-06-08 15:40                         ` Kuppuswamy Sathyanarayanan
  2021-06-08 15:40                         ` [RFC v2-fix-v2 3/3] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
  2 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-08 15:40 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: Andi Kleen <ak@linux.intel.com>

Add an early #VE handler to convert early port IOs into
TDCALLs.

TDX cannot do port IO directly. The TDX module triggers
a #VE exception to let the guest kernel to emulate port
I/O, by converting them into TDCALLs to call the host.

To support port IO in early boot code, add a minimal
support in early exception handlers. This is similar to
what AMD SEV does. This is mainly to support early_printk's
serial driver, as well as potentially the VGA driver
(although it is expected not to be used).

The early handler only does IO calls and nothing else, and
anything that goes wrong results in a normal early exception
panic.

It cannot share the code paths with the normal #VE handler
because it needs to avoid using trace calls or printk.

This early handler allows us to use the normal in*/out*
macros without patching them for every driver. Since, there
is no expectation that early port IO is performance critical,
the #VE emulation cost is worth the simplicity benefit of not
patching out port IO usage in early code. There are also no
concerns with nesting, since there should be no NMIs this
early.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-fix-v1:
 * Fixed commit log as per review comments.
 * Removed IS_ENABLED(CONFIG_INTEL_TDX_GUEST) check for tdg_early_handle_ve()
 * Changed VE_GET_IO_TYPE to VE_IS_IO_OUT() and modified tdx_early_io()
   as per review suggestion.
 * Added support to check string I/O case in tdx_early_io().
 * Modified tdx_early_io() to pass exit_qual instead of ve_info.

 arch/x86/include/asm/tdx.h |  6 ++++
 arch/x86/kernel/head64.c   |  3 ++
 arch/x86/kernel/tdx.c      | 56 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 65 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index ac38ed5b50db..fab5eebf4023 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -74,6 +74,7 @@ u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
 		    struct tdx_hypercall_output *out);
 
 bool tdx_protected_guest_has(unsigned long flag);
+bool tdg_early_handle_ve(struct pt_regs *regs);
 
 /*
  * To support I/O port access in decompressor or early kernel init
@@ -135,6 +136,11 @@ static inline bool tdx_protected_guest_has(unsigned long flag)
 	return false;
 }
 
+static inline bool tdg_early_handle_ve(struct pt_regs *regs)
+{
+	return false;
+}
+
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #ifdef CONFIG_INTEL_TDX_GUEST_KVM
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index d1a4942ae160..323ce7f156f5 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -410,6 +410,9 @@ void __init do_early_exception(struct pt_regs *regs, int trapnr)
 	    trapnr == X86_TRAP_VC && handle_vc_boot_ghcb(regs))
 		return;
 
+	if (trapnr == X86_TRAP_VE && tdg_early_handle_ve(regs))
+		return;
+
 	early_fixup_exception(regs, trapnr);
 }
 
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index b1cdb37a8636..3410cfc8a988 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -13,6 +13,11 @@
 #define TDINFO				1
 #define TDGETVEINFO			3
 
+#define VE_IS_IO_OUT(exit_qual)		(((exit_qual) & 8) ? 0 : 1)
+#define VE_GET_IO_SIZE(exit_qual)	(((exit_qual) & 7) + 1)
+#define VE_GET_PORT_NUM(exit_qual)	((exit_qual) >> 16)
+#define VE_IS_IO_STRING(exit_qual)	((exit_qual) & 16 ? 1 : 0)
+
 static struct {
 	unsigned int gpa_width;
 	unsigned long attributes;
@@ -254,6 +259,57 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	return ret;
 }
 
+/*
+ * Handle early IO, mainly for early printks serial output.
+ * This avoids anything that doesn't work early on, like tracing
+ * or printks, by calling the low level functions directly. Any
+ * problems are handled by falling back to a standard early exception.
+ *
+ * Assumes the IO instruction was using ax, which is enforced
+ * by the standard io.h macros.
+ */
+static __init bool tdx_early_io(struct pt_regs *regs, u32 exit_qual)
+{
+	struct tdx_hypercall_output outh;
+	int out = VE_IS_IO_OUT(exit_qual);
+	int size = VE_GET_IO_SIZE(exit_qual);
+	int port = VE_GET_PORT_NUM(exit_qual);
+	u64 mask = GENMASK(8 * size, 0);
+	bool string = VE_IS_IO_STRING(exit_qual);
+	int ret;
+
+	/* I/O strings ops are unrolled at build time. */
+	if (string)
+		return 0;
+
+	ret = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, out, port,
+			      regs->ax, &outh);
+	if (!out && !ret) {
+		regs->ax &= ~mask;
+		regs->ax |= outh.r11 & mask;
+	}
+
+	return !ret;
+}
+
+/*
+ * Early #VE exception handler. Just used to handle port IOs
+ * for early_printk. If anything goes wrong handle it like
+ * a normal early exception.
+ */
+__init bool tdg_early_handle_ve(struct pt_regs *regs)
+{
+	struct ve_info ve;
+
+	if (tdg_get_ve_info(&ve))
+		return false;
+
+	if (ve.exit_reason == EXIT_REASON_IO_INSTRUCTION)
+		return tdx_early_io(regs, ve.exit_qual);
+
+	return false;
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 3/3] x86/tdx: Handle port I/O
  2021-06-08 15:40                       ` [RFC v2-fix-v2 0/3] " Kuppuswamy Sathyanarayanan
  2021-06-08 15:40                         ` [RFC v2-fix-v2 1/3] x86/tdx: Handle port I/O in decompression code Kuppuswamy Sathyanarayanan
  2021-06-08 15:40                         ` [RFC v2-fix-v2 2/3] x86/tdx: Handle early IO operations Kuppuswamy Sathyanarayanan
@ 2021-06-08 15:40                         ` Kuppuswamy Sathyanarayanan
  2021-06-08 16:26                           ` Dan Williams
  2 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-08 15:40 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

TDX hypervisors cannot emulate instructions directly. This
includes port IO which is normally emulated in the hypervisor.
All port IO instructions inside TDX trigger the #VE exception
in the guest and would be normally emulated there.

Also string I/O is not supported in TDX guest. So, unroll the
string I/O operation into a loop operating on one element at
a time. This method is similar to AMD SEV, so just extend the
support for TDX guest platform.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
Changes since RFC v2-fix-v1:
 * Fixed commit log to adapt to decompression support code split.

 arch/x86/include/asm/io.h |  6 ++++--
 arch/x86/kernel/tdx.c     | 28 ++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 391205dace98..e01d8bf2b37a 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -310,7 +310,8 @@ static inline unsigned type in##bwl##_p(int port)			\
 									\
 static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 {									\
-	if (sev_key_active()) {						\
+	if (sev_key_active() ||						\
+	    protected_guest_has(VM_UNROLL_STRING_IO)) {			\
 		unsigned type *value = (unsigned type *)addr;		\
 		while (count) {						\
 			out##bwl(*value, port);				\
@@ -326,7 +327,8 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 									\
 static inline void ins##bwl(int port, void *addr, unsigned long count)	\
 {									\
-	if (sev_key_active()) {						\
+	if (sev_key_active() ||						\
+	    protected_guest_has(VM_UNROLL_STRING_IO)) {			\
 		unsigned type *value = (unsigned type *)addr;		\
 		while (count) {						\
 			*value = in##bwl(port);				\
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 3410cfc8a988..48a0cc2663ea 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -201,6 +201,31 @@ static void tdg_handle_cpuid(struct pt_regs *regs)
 	regs->dx = out.r15;
 }
 
+/*
+ * Since the way we fail for string case is different we cannot
+ * reuse tdx_handle_early_io().
+ */
+static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
+{
+	struct tdx_hypercall_output outh;
+	int out = VE_IS_IO_OUT(exit_qual);
+	int size = VE_GET_IO_SIZE(exit_qual);
+	int port = VE_GET_PORT_NUM(exit_qual);
+	u64 mask = GENMASK(8 * size, 0);
+	bool string = VE_IS_IO_STRING(exit_qual);
+	int ret;
+
+	/* I/O strings ops are unrolled at build time. */
+	BUG_ON(string);
+
+	ret = __tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, out, port,
+			      regs->ax, &outh);
+	if (!out) {
+		regs->ax &= ~mask;
+		regs->ax |= (ret ? UINT_MAX : outh.r11) & mask;
+	}
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -247,6 +272,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_CPUID:
 		tdg_handle_cpuid(regs);
 		break;
+	case EXIT_REASON_IO_INSTRUCTION:
+		tdg_handle_io(regs, ve->exit_qual);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v3 0/4] x86/tdx: Handle in-kernel MMIO
  2021-06-05 21:56         ` Dan Williams
@ 2021-06-08 15:59           ` Kuppuswamy Sathyanarayanan
  2021-06-08 15:59             ` [RFC v2-fix-v3 1/4] x86/insn-eval: Introduce insn_get_modrm_reg_ptr() Kuppuswamy Sathyanarayanan
                               ` (3 more replies)
  0 siblings, 4 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-08 15:59 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

This patchset addresses the review comments in the patch titled
"x86/tdx: Handle in-kernel MMIO". Since it requires
patch split, sending these together.

Changes since RFC-v2-fix-v2:
 * Further divided patch "x86/sev-es: Abstract out MMIO instruction
   decoding", and moved generic APIs common to both TDX/SEV into
   seperate patches. Also introduced new patch for AMD code to utilize
   the generic APIs.
 * Fixed commit log as per review comments.

Changes since RFC v2-fix:
 * Introduced "x86/sev-es: Abstract out MMIO instruction
   decoding" patch for sharing common code between TDX
   and SEV.
 * Modified TDX MMIO code to utilize common shared functions.
 * Modified commit log to reflect latest changes and to
   address review comments.

Changes since RFC v2:
 * Fixed commit log as per Dave's review.

Kirill A. Shutemov (4):
  x86/insn-eval: Introduce insn_get_modrm_reg_ptr()
  x86/insn-eval: Introduce insn_decode_mmio()
  x86/sev-es: Use insn_decode_mmio() for MMIO implementation
  x86/tdx: Handle in-kernel MMIO

 arch/x86/include/asm/insn-eval.h |  13 +++
 arch/x86/kernel/sev.c            | 171 ++++++++-----------------------
 arch/x86/kernel/tdx.c            | 109 ++++++++++++++++++++
 arch/x86/lib/insn-eval.c         | 102 ++++++++++++++++++
 4 files changed, 264 insertions(+), 131 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v3 1/4] x86/insn-eval: Introduce insn_get_modrm_reg_ptr()
  2021-06-08 15:59           ` [RFC v2-fix-v3 0/4] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
@ 2021-06-08 15:59             ` Kuppuswamy Sathyanarayanan
  2021-06-08 15:59             ` [RFC v2-fix-v3 2/4] x86/insn-eval: Introduce insn_decode_mmio() Kuppuswamy Sathyanarayanan
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-08 15:59 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The helper returns a pointer to the register indicated by
ModRM byte.

It's going to replace vc_insn_get_reg() in the SEV MMIO
implementation. TDX MMIO implementation will also use it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/insn-eval.h |  1 +
 arch/x86/lib/insn-eval.c         | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index 91d7182ad2d6..041f399153b9 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -19,6 +19,7 @@ bool insn_has_rep_prefix(struct insn *insn);
 void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs);
 int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
 int insn_get_modrm_reg_off(struct insn *insn, struct pt_regs *regs);
+void *insn_get_modrm_reg_ptr(struct insn *insn, struct pt_regs *regs);
 unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx);
 int insn_get_code_seg_params(struct pt_regs *regs);
 int insn_fetch_from_user(struct pt_regs *regs,
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index a67afd74232c..9069d0703881 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -850,6 +850,26 @@ int insn_get_modrm_reg_off(struct insn *insn, struct pt_regs *regs)
 	return get_reg_offset(insn, regs, REG_TYPE_REG);
 }
 
+/**
+ * insn_get_modrm_reg_ptr() - Obtain register pointer based on ModRM byte
+ * @insn:	Instruction containing the ModRM byte
+ * @regs:	Register values as seen when entering kernel mode
+ *
+ * Returns:
+ *
+ * The register indicated by the reg part of the ModRM byte.
+ * The register is obtained as a pointer within pt_regs.
+ */
+void *insn_get_modrm_reg_ptr(struct insn *insn, struct pt_regs *regs)
+{
+	int offset;
+
+	offset = insn_get_modrm_reg_off(insn, regs);
+	if (offset < 0)
+		return NULL;
+	return (void *)regs + offset;
+}
+
 /**
  * get_seg_base_limit() - obtain base address and limit of a segment
  * @insn:	Instruction. Must be valid.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v3 2/4] x86/insn-eval: Introduce insn_decode_mmio()
  2021-06-08 15:59           ` [RFC v2-fix-v3 0/4] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
  2021-06-08 15:59             ` [RFC v2-fix-v3 1/4] x86/insn-eval: Introduce insn_get_modrm_reg_ptr() Kuppuswamy Sathyanarayanan
@ 2021-06-08 15:59             ` Kuppuswamy Sathyanarayanan
  2021-06-08 15:59             ` [RFC v2-fix-v3 3/4] x86/sev-es: Use insn_decode_mmio() for MMIO implementation Kuppuswamy Sathyanarayanan
  2021-06-08 15:59             ` [RFC v2-fix-v3 4/4] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
  3 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-08 15:59 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

In preparation for sharing MMIO instruction decode between
SEV-ES and TDX, factor out the common decode into a new
insn_decode_mmio() helper.

For regular virtual machine, MMIO is handled by the VMM and KVM
emulates instructions that caused MMIO. But, this model doesn't
work for a secure VMs (like SEV or TDX) as VMM doesn't have
access to the guest memory and register state. So, for TDX or
SEV VMM needs assistance in handling MMIO. It induces exception
in the guest. Guest has to decode the instruction and handle it
on its own.

The code is based on the current SEV MMIO implementation.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/insn-eval.h | 12 +++++
 arch/x86/lib/insn-eval.c         | 82 ++++++++++++++++++++++++++++++++
 2 files changed, 94 insertions(+)

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index 041f399153b9..4a4ca7e7be66 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -29,4 +29,16 @@ int insn_fetch_from_user_inatomic(struct pt_regs *regs,
 bool insn_decode_from_regs(struct insn *insn, struct pt_regs *regs,
 			   unsigned char buf[MAX_INSN_SIZE], int buf_size);
 
+enum mmio_type {
+	MMIO_DECODE_FAILED,
+	MMIO_WRITE,
+	MMIO_WRITE_IMM,
+	MMIO_READ,
+	MMIO_READ_ZERO_EXTEND,
+	MMIO_READ_SIGN_EXTEND,
+	MMIO_MOVS,
+};
+
+enum mmio_type insn_decode_mmio(struct insn *insn, int *bytes);
+
 #endif /* _ASM_X86_INSN_EVAL_H */
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index 9069d0703881..ae15a74040a5 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -1559,3 +1559,85 @@ bool insn_decode_from_regs(struct insn *insn, struct pt_regs *regs,
 
 	return true;
 }
+
+/**
+ * insn_decode_mmio() - Decode a MMIO instruction
+ * @insn:	Structure to store decoded instruction
+ * @bytes:	Returns size of memory operand
+ *
+ * Decodes instruction that used for Memory-mapped I/O.
+ *
+ * Returns:
+ *
+ * Type of the instruction. Size of the memory operand is stored in
+ * @bytes. If decode failed, MMIO_DECODE_FAILED returned.
+ */
+enum mmio_type insn_decode_mmio(struct insn *insn, int *bytes)
+{
+	int type = MMIO_DECODE_FAILED;
+
+	*bytes = 0;
+
+	insn_get_opcode(insn);
+	switch (insn->opcode.bytes[0]) {
+	case 0x88: /* MOV m8,r8 */
+		*bytes = 1;
+		fallthrough;
+	case 0x89: /* MOV m16/m32/m64, r16/m32/m64 */
+		if (!*bytes)
+			*bytes = insn->opnd_bytes;
+		type = MMIO_WRITE;
+		break;
+
+	case 0xc6: /* MOV m8, imm8 */
+		*bytes = 1;
+		fallthrough;
+	case 0xc7: /* MOV m16/m32/m64, imm16/imm32/imm64 */
+		if (!*bytes)
+			*bytes = insn->opnd_bytes;
+		type = MMIO_WRITE_IMM;
+		break;
+
+	case 0x8a: /* MOV r8, m8 */
+		*bytes = 1;
+		fallthrough;
+	case 0x8b: /* MOV r16/r32/r64, m16/m32/m64 */
+		if (!*bytes)
+			*bytes = insn->opnd_bytes;
+		type = MMIO_READ;
+		break;
+
+	case 0xa4: /* MOVS m8, m8 */
+		*bytes = 1;
+		fallthrough;
+	case 0xa5: /* MOVS m16/m32/m64, m16/m32/m64 */
+		if (!*bytes)
+			*bytes = insn->opnd_bytes;
+		type = MMIO_MOVS;
+		break;
+
+	case 0x0f: /* Two-byte instruction */
+		switch (insn->opcode.bytes[1]) {
+		case 0xb6: /* MOVZX r16/r32/r64, m8 */
+			*bytes = 1;
+			fallthrough;
+		case 0xb7: /* MOVZX r32/r64, m16 */
+			if (!*bytes)
+				*bytes = 2;
+			type = MMIO_READ_ZERO_EXTEND;
+			break;
+
+		case 0xbe: /* MOVSX r16/r32/r64, m8 */
+			*bytes = 1;
+			fallthrough;
+		case 0xbf: /* MOVSX r32/r64, m16 */
+			if (!*bytes)
+				*bytes = 2;
+			type = MMIO_READ_SIGN_EXTEND;
+			break;
+		}
+		break;
+	}
+
+	return type;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v3 3/4] x86/sev-es: Use insn_decode_mmio() for MMIO implementation
  2021-06-08 15:59           ` [RFC v2-fix-v3 0/4] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
  2021-06-08 15:59             ` [RFC v2-fix-v3 1/4] x86/insn-eval: Introduce insn_get_modrm_reg_ptr() Kuppuswamy Sathyanarayanan
  2021-06-08 15:59             ` [RFC v2-fix-v3 2/4] x86/insn-eval: Introduce insn_decode_mmio() Kuppuswamy Sathyanarayanan
@ 2021-06-08 15:59             ` Kuppuswamy Sathyanarayanan
  2021-06-08 15:59             ` [RFC v2-fix-v3 4/4] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
  3 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-08 15:59 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Tom Lendacky,
	Joerg Roedel, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Switch SEV implementation to insn_decode_mmio(). The helper is going
to be used by TDX too.

No functional changes. It is only build-tested.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/sev.c | 171 ++++++++++--------------------------------
 1 file changed, 40 insertions(+), 131 deletions(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index c6b0ee3e2345..cdb5c453e21c 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -790,22 +790,6 @@ static void __init vc_early_forward_exception(struct es_em_ctxt *ctxt)
 	do_early_exception(ctxt->regs, trapnr);
 }
 
-static long *vc_insn_get_reg(struct es_em_ctxt *ctxt)
-{
-	long *reg_array;
-	int offset;
-
-	reg_array = (long *)ctxt->regs;
-	offset    = insn_get_modrm_reg_off(&ctxt->insn, ctxt->regs);
-
-	if (offset < 0)
-		return NULL;
-
-	offset /= sizeof(long);
-
-	return reg_array + offset;
-}
-
 static long *vc_insn_get_rm(struct es_em_ctxt *ctxt)
 {
 	long *reg_array;
@@ -853,76 +837,6 @@ static enum es_result vc_do_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
 	return sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, exit_info_1, exit_info_2);
 }
 
-static enum es_result vc_handle_mmio_twobyte_ops(struct ghcb *ghcb,
-						 struct es_em_ctxt *ctxt)
-{
-	struct insn *insn = &ctxt->insn;
-	unsigned int bytes = 0;
-	enum es_result ret;
-	int sign_byte;
-	long *reg_data;
-
-	switch (insn->opcode.bytes[1]) {
-		/* MMIO Read w/ zero-extension */
-	case 0xb6:
-		bytes = 1;
-		fallthrough;
-	case 0xb7:
-		if (!bytes)
-			bytes = 2;
-
-		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
-		if (ret)
-			break;
-
-		/* Zero extend based on operand size */
-		reg_data = vc_insn_get_reg(ctxt);
-		if (!reg_data)
-			return ES_DECODE_FAILED;
-
-		memset(reg_data, 0, insn->opnd_bytes);
-
-		memcpy(reg_data, ghcb->shared_buffer, bytes);
-		break;
-
-		/* MMIO Read w/ sign-extension */
-	case 0xbe:
-		bytes = 1;
-		fallthrough;
-	case 0xbf:
-		if (!bytes)
-			bytes = 2;
-
-		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
-		if (ret)
-			break;
-
-		/* Sign extend based on operand size */
-		reg_data = vc_insn_get_reg(ctxt);
-		if (!reg_data)
-			return ES_DECODE_FAILED;
-
-		if (bytes == 1) {
-			u8 *val = (u8 *)ghcb->shared_buffer;
-
-			sign_byte = (*val & 0x80) ? 0xff : 0x00;
-		} else {
-			u16 *val = (u16 *)ghcb->shared_buffer;
-
-			sign_byte = (*val & 0x8000) ? 0xff : 0x00;
-		}
-		memset(reg_data, sign_byte, insn->opnd_bytes);
-
-		memcpy(reg_data, ghcb->shared_buffer, bytes);
-		break;
-
-	default:
-		ret = ES_UNSUPPORTED;
-	}
-
-	return ret;
-}
-
 /*
  * The MOVS instruction has two memory operands, which raises the
  * problem that it is not known whether the access to the source or the
@@ -990,83 +904,78 @@ static enum es_result vc_handle_mmio_movs(struct es_em_ctxt *ctxt,
 		return ES_RETRY;
 }
 
-static enum es_result vc_handle_mmio(struct ghcb *ghcb,
-				     struct es_em_ctxt *ctxt)
+static enum es_result vc_handle_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 {
 	struct insn *insn = &ctxt->insn;
 	unsigned int bytes = 0;
+	enum mmio_type mmio;
 	enum es_result ret;
+	u8 sign_byte;
 	long *reg_data;
 
-	switch (insn->opcode.bytes[0]) {
-	/* MMIO Write */
-	case 0x88:
-		bytes = 1;
-		fallthrough;
-	case 0x89:
-		if (!bytes)
-			bytes = insn->opnd_bytes;
+	mmio = insn_decode_mmio(insn, &bytes);
+	if (mmio == MMIO_DECODE_FAILED)
+		return ES_DECODE_FAILED;
 
-		reg_data = vc_insn_get_reg(ctxt);
+	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
+		reg_data = insn_get_modrm_reg_ptr(insn, ctxt->regs);
 		if (!reg_data)
 			return ES_DECODE_FAILED;
+	}
 
+	switch (mmio) {
+	case MMIO_WRITE:
 		memcpy(ghcb->shared_buffer, reg_data, bytes);
-
 		ret = vc_do_mmio(ghcb, ctxt, bytes, false);
 		break;
-
-	case 0xc6:
-		bytes = 1;
-		fallthrough;
-	case 0xc7:
-		if (!bytes)
-			bytes = insn->opnd_bytes;
-
+	case MMIO_WRITE_IMM:
 		memcpy(ghcb->shared_buffer, insn->immediate1.bytes, bytes);
-
 		ret = vc_do_mmio(ghcb, ctxt, bytes, false);
 		break;
-
-		/* MMIO Read */
-	case 0x8a:
-		bytes = 1;
-		fallthrough;
-	case 0x8b:
-		if (!bytes)
-			bytes = insn->opnd_bytes;
-
+	case MMIO_READ:
 		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
 		if (ret)
 			break;
 
-		reg_data = vc_insn_get_reg(ctxt);
-		if (!reg_data)
-			return ES_DECODE_FAILED;
-
 		/* Zero-extend for 32-bit operation */
 		if (bytes == 4)
 			*reg_data = 0;
 
 		memcpy(reg_data, ghcb->shared_buffer, bytes);
 		break;
+	case MMIO_READ_ZERO_EXTEND:
+		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+		if (ret)
+			break;
+
+		memset(reg_data, 0, insn->opnd_bytes);
+		memcpy(reg_data, ghcb->shared_buffer, bytes);
+		break;
+	case MMIO_READ_SIGN_EXTEND:
+		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+		if (ret)
+			break;
 
-		/* MOVS instruction */
-	case 0xa4:
-		bytes = 1;
-		fallthrough;
-	case 0xa5:
-		if (!bytes)
-			bytes = insn->opnd_bytes;
+		if (bytes == 1) {
+			u8 *val = (u8 *)ghcb->shared_buffer;
 
-		ret = vc_handle_mmio_movs(ctxt, bytes);
+			sign_byte = (*val & 0x80) ? 0xff : 0x00;
+		} else {
+			u16 *val = (u16 *)ghcb->shared_buffer;
+
+			sign_byte = (*val & 0x8000) ? 0xff : 0x00;
+		}
+
+		/* Sign extend based on operand size */
+		memset(reg_data, sign_byte, insn->opnd_bytes);
+		memcpy(reg_data, ghcb->shared_buffer, bytes);
 		break;
-		/* Two-Byte Opcodes */
-	case 0x0f:
-		ret = vc_handle_mmio_twobyte_ops(ghcb, ctxt);
+	case MMIO_MOVS:
+		ret = vc_handle_mmio_movs(ctxt, bytes);
 		break;
 	default:
 		ret = ES_UNSUPPORTED;
+		break;
 	}
 
 	return ret;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v3 4/4] x86/tdx: Handle in-kernel MMIO
  2021-06-08 15:59           ` [RFC v2-fix-v3 0/4] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
                               ` (2 preceding siblings ...)
  2021-06-08 15:59             ` [RFC v2-fix-v3 3/4] x86/sev-es: Use insn_decode_mmio() for MMIO implementation Kuppuswamy Sathyanarayanan
@ 2021-06-08 15:59             ` Kuppuswamy Sathyanarayanan
  3 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-08 15:59 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

In traditional VMs, MMIO is usually implemented by giving a
guest access to a mapping which will cause a VMEXIT on access
and then the VMM emulating the access. That's not possible in
TDX guest because VMEXIT will expose the register state to the
host. TDX guests don't trust the host and can't have its state
exposed to the host. In TDX the MMIO regions are instead
configured to trigger a #VE exception in the guest. The guest #VE
handler then emulates the MMIO instruction inside the guest and
converts them into a controlled TDCALL to the host, rather than
completely exposing the state to the host.

Currently, we only support MMIO for instructions that are known
to come from io.h macros (build_mmio_read/write()). For drivers
that don't use the io.h macros or uses structure overlay to do
MMIO are currently not supported in TDX guest (for example the
MMIO based XAPIC is disable at runtime for TDX).

This way of handling is similar to AMD SEV.

Also, reasons for supporting #VE based MMIO in TDX guest are,

* MMIO is widely used and we'll have more drivers in the future.
* We don't want to annotate every TDX specific MMIO readl/writel etc.
* If we didn't annotate we would need to add an alternative to every
  MMIO access in the kernel (even though 99.9% will never be used on
  TDX) which would be a complete waste and incredible binary bloat
  for nothing.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 109 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 48a0cc2663ea..053e69782e3d 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -5,6 +5,9 @@
 
 #include <asm/tdx.h>
 #include <asm/vmx.h>
+#include <asm/insn.h>
+#include <asm/insn-eval.h>
+#include <linux/sched/signal.h> /* force_sig_fault() */
 
 #include <linux/cpu.h>
 #include <linux/protected_guest.h>
@@ -226,6 +229,104 @@ static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
 	}
 }
 
+static unsigned long tdg_mmio(int size, bool write, unsigned long addr,
+		unsigned long *val)
+{
+	struct tdx_hypercall_output out = {0};
+	u64 err;
+
+	err = __tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
+			      addr, *val, &out);
+	*val = out.r11;
+	return err;
+}
+
+static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+	struct insn insn = {};
+	char buffer[MAX_INSN_SIZE];
+	enum mmio_type mmio;
+	unsigned long *reg;
+	int size, ret;
+	u8 sign_byte;
+	unsigned long val;
+
+	if (user_mode(regs)) {
+		ret = insn_fetch_from_user(regs, buffer);
+		if (!ret)
+			return -EFAULT;
+		if (!insn_decode_from_regs(&insn, regs, buffer, ret))
+			return -EFAULT;
+	} else {
+		ret = copy_from_kernel_nofault(buffer, (void *)regs->ip,
+					       MAX_INSN_SIZE);
+		if (ret)
+			return -EFAULT;
+		insn_init(&insn, buffer, MAX_INSN_SIZE, 1);
+		insn_get_length(&insn);
+	}
+
+	mmio = insn_decode_mmio(&insn, &size);
+	if (mmio == MMIO_DECODE_FAILED)
+		return -EFAULT;
+
+	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
+		reg = insn_get_modrm_reg_ptr(&insn, regs);
+		if (!reg)
+			return -EFAULT;
+	}
+
+	switch (mmio) {
+	case MMIO_WRITE:
+		memcpy(&val, reg, size);
+		ret = tdg_mmio(size, true, ve->gpa, &val);
+		break;
+	case MMIO_WRITE_IMM:
+		val = insn.immediate.value;
+		ret = tdg_mmio(size, true, ve->gpa, &val);
+		break;
+	case MMIO_READ:
+		ret = tdg_mmio(size, false, ve->gpa, &val);
+		if (ret)
+			break;
+		/* Zero-extend for 32-bit operation */
+		if (size == 4)
+			*reg = 0;
+		memcpy(reg, &val, size);
+		break;
+	case MMIO_READ_ZERO_EXTEND:
+		ret = tdg_mmio(size, false, ve->gpa, &val);
+		if (ret)
+			break;
+
+		/* Zero extend based on operand size */
+		memset(reg, 0, insn.opnd_bytes);
+		memcpy(reg, &val, size);
+		break;
+	case MMIO_READ_SIGN_EXTEND:
+		ret = tdg_mmio(size, false, ve->gpa, &val);
+		if (ret)
+			break;
+
+		if (size == 1)
+			sign_byte = (val & 0x80) ? 0xff : 0x00;
+		else
+			sign_byte = (val & 0x8000) ? 0xff : 0x00;
+
+		/* Sign extend based on operand size */
+		memset(reg, sign_byte, insn.opnd_bytes);
+		memcpy(reg, &val, size);
+		break;
+	case MMIO_MOVS:
+	case MMIO_DECODE_FAILED:
+		return -EFAULT;
+	}
+
+	if (ret)
+		return -EFAULT;
+	return insn.length;
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -275,6 +376,14 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_IO_INSTRUCTION:
 		tdg_handle_io(regs, ve->exit_qual);
 		break;
+	case EXIT_REASON_EPT_VIOLATION:
+		/* Currently only MMIO triggers EPT violation */
+		ve->instr_len = tdg_handle_mmio(regs, ve);
+		if (ve->instr_len < 0) {
+			pr_warn_once("MMIO failed\n");
+			return -EFAULT;
+		}
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 3/3] x86/tdx: Handle port I/O
  2021-06-08 15:40                         ` [RFC v2-fix-v2 3/3] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
@ 2021-06-08 16:26                           ` Dan Williams
  0 siblings, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-06-08 16:26 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Tue, Jun 8, 2021 at 8:40 AM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> TDX hypervisors cannot emulate instructions directly. This
> includes port IO which is normally emulated in the hypervisor.
> All port IO instructions inside TDX trigger the #VE exception
> in the guest and would be normally emulated there.
>
> Also string I/O is not supported in TDX guest. So, unroll the
> string I/O operation into a loop operating on one element at
> a time. This method is similar to AMD SEV, so just extend the
> support for TDX guest platform.
>
> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> ---
> Changes since RFC v2-fix-v1:
>  * Fixed commit log to adapt to decompression support code split.

Looks good to me:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-04-26 18:01 ` [RFC v2 08/32] x86/traps: Add #VE support for TDX guest Kuppuswamy Sathyanarayanan
  2021-05-07 21:36   ` Dave Hansen
@ 2021-06-08 17:02   ` Dave Hansen
  2021-06-08 17:48     ` Sean Christopherson
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-06-08 17:02 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +DEFINE_IDTENTRY(exc_virtualization_exception)
> +{
> +	struct ve_info ve;
> +	int ret;
> +
> +	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> +
> +	/*
> +	 * Consume #VE info before re-enabling interrupts. It will be
> +	 * re-enabled after executing the TDGETVEINFO TDCALL.
> +	 */
> +	ret = tdg_get_ve_info(&ve);

Is it safe to have *anything* before the tdg_get_ve_info()?  For
instance, say that RCU_LOCKDEP_WARN() triggers.  Will anything in there
do MMIO?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-06-08 17:02   ` [RFC v2 08/32] " Dave Hansen
@ 2021-06-08 17:48     ` Sean Christopherson
  2021-06-08 17:53       ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-06-08 17:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel

On Tue, Jun 08, 2021, Dave Hansen wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> > +#ifdef CONFIG_INTEL_TDX_GUEST
> > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > +{
> > +	struct ve_info ve;
> > +	int ret;
> > +
> > +	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> > +
> > +	/*
> > +	 * Consume #VE info before re-enabling interrupts. It will be
> > +	 * re-enabled after executing the TDGETVEINFO TDCALL.
> > +	 */
> > +	ret = tdg_get_ve_info(&ve);
> 
> Is it safe to have *anything* before the tdg_get_ve_info()?  For
> instance, say that RCU_LOCKDEP_WARN() triggers.  Will anything in there
> do MMIO?

I doubt it's safe, anything that's doing printing has the potential to trigger
#VE.  Even if we can prove it's safe for all possible paths, I can't think of a
reason to allow anything that's not absolutely necessary before retrieving the
#VE info.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-06-08 17:48     ` Sean Christopherson
@ 2021-06-08 17:53       ` Dave Hansen
  2021-06-08 18:12         ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-06-08 17:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel

On 6/8/21 10:48 AM, Sean Christopherson wrote:
> On Tue, Jun 08, 2021, Dave Hansen wrote:
>> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>>> +#ifdef CONFIG_INTEL_TDX_GUEST
>>> +DEFINE_IDTENTRY(exc_virtualization_exception)
>>> +{
>>> +	struct ve_info ve;
>>> +	int ret;
>>> +
>>> +	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
>>> +
>>> +	/*
>>> +	 * Consume #VE info before re-enabling interrupts. It will be
>>> +	 * re-enabled after executing the TDGETVEINFO TDCALL.
>>> +	 */
>>> +	ret = tdg_get_ve_info(&ve);
>> Is it safe to have *anything* before the tdg_get_ve_info()?  For
>> instance, say that RCU_LOCKDEP_WARN() triggers.  Will anything in there
>> do MMIO?
> I doubt it's safe, anything that's doing printing has the potential to trigger
> #VE.  Even if we can prove it's safe for all possible paths, I can't think of a
> reason to allow anything that's not absolutely necessary before retrieving the
> #VE info.

What about tracing?  Can I plop a kprobe in here or turn on ftrace?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-06-08 17:53       ` Dave Hansen
@ 2021-06-08 18:12         ` Andi Kleen
  2021-06-08 18:15           ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-06-08 18:12 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel


On 6/8/2021 10:53 AM, Dave Hansen wrote:
> On 6/8/21 10:48 AM, Sean Christopherson wrote:
>> On Tue, Jun 08, 2021, Dave Hansen wrote:
>>> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>>>> +#ifdef CONFIG_INTEL_TDX_GUEST
>>>> +DEFINE_IDTENTRY(exc_virtualization_exception)
>>>> +{
>>>> +	struct ve_info ve;
>>>> +	int ret;
>>>> +
>>>> +	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
>>>> +
>>>> +	/*
>>>> +	 * Consume #VE info before re-enabling interrupts. It will be
>>>> +	 * re-enabled after executing the TDGETVEINFO TDCALL.
>>>> +	 */
>>>> +	ret = tdg_get_ve_info(&ve);
>>> Is it safe to have *anything* before the tdg_get_ve_info()?  For
>>> instance, say that RCU_LOCKDEP_WARN() triggers.  Will anything in there
>>> do MMIO?
>> I doubt it's safe, anything that's doing printing has the potential to trigger
>> #VE.  Even if we can prove it's safe for all possible paths, I can't think of a
>> reason to allow anything that's not absolutely necessary before retrieving the
>> #VE info.
> What about tracing?  Can I plop a kprobe in here or turn on ftrace?

I believe neither does mmio/msr normally (except maybe ftrace+tp_printk, 
but that will likely work because it shouldn't recurse more than once 
due to ftrace's reentry protection)

-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-06-08 18:12         ` Andi Kleen
@ 2021-06-08 18:15           ` Dave Hansen
  2021-06-08 18:17             ` Andy Lutomirski
  2021-06-08 18:18             ` Andi Kleen
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-06-08 18:15 UTC (permalink / raw)
  To: Andi Kleen, Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel

On 6/8/21 11:12 AM, Andi Kleen wrote:
> I believe neither does mmio/msr normally (except maybe
> ftrace+tp_printk, but that will likely work because it shouldn't
> recurse more than once due to ftrace's reentry protection)

Can it do MMIO:

> +DEFINE_IDTENTRY(exc_virtualization_exception)
> +{
=======> HERE
> +    ret = tdg_get_ve_info(&ve); 

Recursion isn't the problem.  It would double-fault there, right?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-06-08 18:15           ` Dave Hansen
@ 2021-06-08 18:17             ` Andy Lutomirski
  2021-06-08 18:18             ` Andi Kleen
  1 sibling, 0 replies; 381+ messages in thread
From: Andy Lutomirski @ 2021-06-08 18:17 UTC (permalink / raw)
  To: Dave Hansen, Andi Kleen, Sean Christopherson
  Cc: Sathyanarayanan Kuppuswamy, Peter Zijlstra (Intel),
	Williams, Dan J, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Linux Kernel Mailing List



On Tue, Jun 8, 2021, at 11:15 AM, Dave Hansen wrote:
> On 6/8/21 11:12 AM, Andi Kleen wrote:
> > I believe neither does mmio/msr normally (except maybe
> > ftrace+tp_printk, but that will likely work because it shouldn't
> > recurse more than once due to ftrace's reentry protection)
> 
> Can it do MMIO:
> 
> > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > +{
> =======> HERE
> > +    ret = tdg_get_ve_info(&ve); 
> 
> Recursion isn't the problem.  It would double-fault there, right?
> 

We should do the get_ve_info in a noinstr region.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-06-08 18:15           ` Dave Hansen
  2021-06-08 18:17             ` Andy Lutomirski
@ 2021-06-08 18:18             ` Andi Kleen
  1 sibling, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-06-08 18:18 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel


On 6/8/2021 11:15 AM, Dave Hansen wrote:
> On 6/8/21 11:12 AM, Andi Kleen wrote:
>> I believe neither does mmio/msr normally (except maybe
>> ftrace+tp_printk, but that will likely work because it shouldn't
>> recurse more than once due to ftrace's reentry protection)
> Can it do MMIO:
>
>> +DEFINE_IDTENTRY(exc_virtualization_exception)
>> +{
> =======> HERE
>> +    ret = tdg_get_ve_info(&ve);
> Recursion isn't the problem.  It would double-fault there, right?

Yes that's right. tp_printk already has a lot of other corner cases that 
break though, so it's not a real issue.

-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v3 1/1] x86: Introduce generic protected guest abstraction
  2021-06-07 22:26                       ` Kuppuswamy, Sathyanarayanan
@ 2021-06-08 21:30                         ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-08 21:30 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Borislav Petkov
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kuppuswamy Sathyanarayanan

Add a generic way to check if we run with an encrypted guest,
without requiring x86 specific ifdefs. This can then be used in
non architecture specific code. 

prot_guest_has() is used to check for protected guest feature
flags.

Originally-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-fix-v2:
 * Renamed protected_guest_has() to prot_guest_has().
 * Changed flag prefix from VM_ to PR_GUEST_
 * Merged Borislav AMD implementation fix.

 arch/x86/include/asm/sev.h      |  3 +++
 arch/x86/include/asm/tdx.h      |  7 ++++++
 arch/x86/kernel/sev.c           | 15 +++++++++++++
 arch/x86/kernel/tdx.c           | 15 +++++++++++++
 arch/x86/mm/mem_encrypt.c       |  1 +
 include/linux/protected_guest.h | 38 +++++++++++++++++++++++++++++++++
 6 files changed, 79 insertions(+)
 create mode 100644 include/linux/protected_guest.h

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index fa5cd05d3b5b..e9b0b93a3157 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -81,12 +81,15 @@ static __always_inline void sev_es_nmi_complete(void)
 		__sev_es_nmi_complete();
 }
 extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
+bool sev_protected_guest_has(unsigned long flag);
+
 #else
 static inline void sev_es_ist_enter(struct pt_regs *regs) { }
 static inline void sev_es_ist_exit(void) { }
 static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
 static inline void sev_es_nmi_complete(void) { }
 static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
+static inline bool sev_protected_guest_has(unsigned long flag) { return false; }
 #endif
 
 #endif
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f0c1912837c8..cbfe7479f2a3 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -71,6 +71,8 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
 		    struct tdx_hypercall_output *out);
 
+bool tdx_protected_guest_has(unsigned long flag);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
@@ -80,6 +82,11 @@ static inline bool is_tdx_guest(void)
 
 static inline void tdx_early_init(void) { };
 
+static inline bool tdx_protected_guest_has(unsigned long flag)
+{
+	return false;
+}
+
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #ifdef CONFIG_INTEL_TDX_GUEST_KVM
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 651b81cd648e..16e5c5f25e6f 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -19,6 +19,7 @@
 #include <linux/memblock.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/protected_guest.h>
 
 #include <asm/cpu_entry_area.h>
 #include <asm/stacktrace.h>
@@ -1493,3 +1494,17 @@ bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
 	while (true)
 		halt();
 }
+
+bool sev_protected_guest_has(unsigned long flag)
+{
+	switch (flag) {
+	case PR_GUEST_MEM_ENCRYPT:
+	case PR_GUEST_MEM_ENCRYPT_ACTIVE:
+	case PR_GUEST_UNROLL_STRING_IO:
+	case PR_GUEST_HOST_MEM_ENCRYPT:
+		return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(sev_protected_guest_has);
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 17725646eb30..111f15c05e24 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,7 @@
 #include <asm/vmx.h>
 
 #include <linux/cpu.h>
+#include <linux/protected_guest.h>
 
 /* TDX Module call Leaf IDs */
 #define TDINFO				1
@@ -75,6 +76,20 @@ bool is_tdx_guest(void)
 }
 EXPORT_SYMBOL_GPL(is_tdx_guest);
 
+bool tdx_protected_guest_has(unsigned long flag)
+{
+	switch (flag) {
+	case PR_GUEST_MEM_ENCRYPT:
+	case PR_GUEST_MEM_ENCRYPT_ACTIVE:
+	case PR_GUEST_UNROLL_STRING_IO:
+	case PR_GUEST_SHARED_MAPPING_INIT:
+		return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(tdx_protected_guest_has);
+
 static void tdg_get_info(void)
 {
 	u64 ret;
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index ff08dc463634..d0026bce47df 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -20,6 +20,7 @@
 #include <linux/bitops.h>
 #include <linux/dma-mapping.h>
 #include <linux/virtio_config.h>
+#include <linux/protected_guest.h>
 
 #include <asm/tlbflush.h>
 #include <asm/fixmap.h>
diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
new file mode 100644
index 000000000000..adfa62e2615e
--- /dev/null
+++ b/include/linux/protected_guest.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_PROTECTED_GUEST_H
+#define _LINUX_PROTECTED_GUEST_H 1
+
+#include <asm/processor.h>
+#include <asm/tdx.h>
+#include <asm/sev.h>
+
+/* Protected Guest Feature Flags (leave 0-0xff for arch specific flags) */
+
+/* Support for guest encryption */
+#define PR_GUEST_MEM_ENCRYPT			0x100
+/* Encryption support is active */
+#define PR_GUEST_MEM_ENCRYPT_ACTIVE		0x101
+/* Support for unrolled string IO */
+#define PR_GUEST_UNROLL_STRING_IO		0x102
+/* Support for host memory encryption */
+#define PR_GUEST_HOST_MEM_ENCRYPT		0x103
+/* Support for shared mapping initialization (after early init) */
+#define PR_GUEST_SHARED_MAPPING_INIT		0x104
+
+#if defined(CONFIG_INTEL_TDX_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
+
+static inline bool prot_guest_has(unsigned long flag)
+{
+	if (is_tdx_guest())
+		return tdx_protected_guest_has(flag);
+	else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
+		return sev_protected_guest_has(flag);
+
+	return false;
+}
+
+#else
+static inline bool prot_guest_has(unsigned long flag) { return false; }
+#endif
+
+#endif /* _LINUX_PROTECTED_GUEST_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-05  3:35                                                 ` Dan Williams
@ 2021-06-08 21:35                                                   ` Kuppuswamy Sathyanarayanan
  2021-06-08 21:41                                                     ` Dan Williams
                                                                       ` (2 more replies)
  0 siblings, 3 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-08 21:35 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Current TDX spec does not have support to emulate the WBINVD
instruction. So, add support to skip WBINVD instruction in
drivers that are currently enabled in the TDX guest.

Functionally only devices outside the CPU (such as DMA devices,
or persistent memory for flushing) can notice the external side
effects from WBINVD's cache flushing for write back mappings.
One exception here is MKTME, but that is not visible outside
the TDX module and not possible inside a TDX guest.

Currently TDX does not support DMA, because DMA typically needs
uncached access for MMIO, and the current TDX module always
sets the IgnorePAT bit, which prevents that.
   
Persistent memory is also currently not supported. Another code
path that uses WBINVD is the MTRR driver, but EPT/virtualization
always disables MTRRs so those are not needed. This all implies
WBINVD is not needed with current TDX.

So, most drivers/code-paths that use wbinvd instructions are
already disabled for TDX guest platforms via config-option/BIOS.
Following are the list of drivers that use wbinvd instruction
and are still enabled for TDX guests.
   
drivers/acpi/sleep.c
drivers/acpi/acpica/hwsleep.c
   
Since cache is always coherent in TDX guests, making wbinvd as
noop should not cause any issues. This behavior is the same as
KVM guest.
   
Also, hwsleep shouldn't happen for TDX guest because the TDX
BIOS won't enable it, but it's better to disable it anyways

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-fix-v2:
 * Instead of handling WBINVD #VE exception as nop, we skip its
   usage in currently enabled drivers.
 * Adapted commit log for above change.

 arch/x86/kernel/tdx.c           |  1 +
 drivers/acpi/acpica/hwsleep.c   | 12 +++++++++---
 drivers/acpi/sleep.c            | 26 +++++++++++++++++++++++---
 include/linux/protected_guest.h |  2 ++
 4 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 1caf9fa5bb30..e33928131e6a 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -100,6 +100,7 @@ bool tdx_protected_guest_has(unsigned long flag)
 	case PR_GUEST_MEM_ENCRYPT_ACTIVE:
 	case PR_GUEST_UNROLL_STRING_IO:
 	case PR_GUEST_SHARED_MAPPING_INIT:
+	case PR_GUEST_DISABLE_WBINVD:
 		return true;
 	}
 
diff --git a/drivers/acpi/acpica/hwsleep.c b/drivers/acpi/acpica/hwsleep.c
index 14baa13bf848..9d40df1b8a74 100644
--- a/drivers/acpi/acpica/hwsleep.c
+++ b/drivers/acpi/acpica/hwsleep.c
@@ -9,6 +9,7 @@
  *****************************************************************************/
 
 #include <acpi/acpi.h>
+#include <linux/protected_guest.h>
 #include "accommon.h"
 
 #define _COMPONENT          ACPI_HARDWARE
@@ -108,9 +109,14 @@ acpi_status acpi_hw_legacy_sleep(u8 sleep_state)
 	pm1a_control |= sleep_enable_reg_info->access_bit_mask;
 	pm1b_control |= sleep_enable_reg_info->access_bit_mask;
 
-	/* Flush caches, as per ACPI specification */
-
-	ACPI_FLUSH_CPU_CACHE();
+	/*
+	 * WBINVD instruction is not supported in TDX
+	 * guest. Since ACPI_FLUSH_CPU_CACHE() uses
+	 * WBINVD, skip cache flushes for TDX guests.
+	 */
+	if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
+		/* Flush caches, as per ACPI specification */
+		ACPI_FLUSH_CPU_CACHE();
 
 	status = acpi_os_enter_sleep(sleep_state, pm1a_control, pm1b_control);
 	if (status == AE_CTRL_TERMINATE) {
diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index df386571da98..3d6c213481f0 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -18,6 +18,7 @@
 #include <linux/acpi.h>
 #include <linux/module.h>
 #include <linux/syscore_ops.h>
+#include <linux/protected_guest.h>
 #include <asm/io.h>
 #include <trace/events/power.h>
 
@@ -71,7 +72,14 @@ static int acpi_sleep_prepare(u32 acpi_state)
 		acpi_set_waking_vector(acpi_wakeup_address);
 
 	}
-	ACPI_FLUSH_CPU_CACHE();
+
+	/*
+	 * WBINVD instruction is not supported in TDX
+	 * guest. Since ACPI_FLUSH_CPU_CACHE() uses
+	 * WBINVD, skip cache flushes for TDX guests.
+	 */
+	if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
+		ACPI_FLUSH_CPU_CACHE();
 #endif
 	printk(KERN_INFO PREFIX "Preparing to enter system sleep state S%d\n",
 		acpi_state);
@@ -566,7 +574,13 @@ static int acpi_suspend_enter(suspend_state_t pm_state)
 	u32 acpi_state = acpi_target_sleep_state;
 	int error;
 
-	ACPI_FLUSH_CPU_CACHE();
+	/*
+	 * WBINVD instruction is not supported in TDX
+	 * guest. Since ACPI_FLUSH_CPU_CACHE() uses
+	 * WBINVD, skip cache flushes for TDX guests.
+	 */
+	if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
+		ACPI_FLUSH_CPU_CACHE();
 
 	trace_suspend_resume(TPS("acpi_suspend"), acpi_state, true);
 	switch (acpi_state) {
@@ -899,7 +913,13 @@ static int acpi_hibernation_enter(void)
 {
 	acpi_status status = AE_OK;
 
-	ACPI_FLUSH_CPU_CACHE();
+	/*
+	 * WBINVD instruction is not supported in TDX
+	 * guest. Since ACPI_FLUSH_CPU_CACHE() uses
+	 * WBINVD, skip cache flushes for TDX guests.
+	 */
+	if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
+		ACPI_FLUSH_CPU_CACHE();
 
 	/* This shouldn't return.  If it returns, we have a problem */
 	status = acpi_enter_sleep_state(ACPI_STATE_S4);
diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
index adfa62e2615e..0ec4dab86f67 100644
--- a/include/linux/protected_guest.h
+++ b/include/linux/protected_guest.h
@@ -18,6 +18,8 @@
 #define PR_GUEST_HOST_MEM_ENCRYPT		0x103
 /* Support for shared mapping initialization (after early init) */
 #define PR_GUEST_SHARED_MAPPING_INIT		0x104
+/* Support to disable WBINVD */
+#define PR_GUEST_DISABLE_WBINVD			0x105
 
 #if defined(CONFIG_INTEL_TDX_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-08 21:35                                                   ` [RFC v2-fix-v3 1/1] x86/tdx: Skip " Kuppuswamy Sathyanarayanan
@ 2021-06-08 21:41                                                     ` Dan Williams
  2021-06-08 22:17                                                     ` Dave Hansen
  2021-06-08 23:32                                                     ` Dan Williams
  2 siblings, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-06-08 21:41 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Rafael J. Wysocki, Linux ACPI

[ add Rafael and linux-acpi ]

On Tue, Jun 8, 2021 at 2:35 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> Current TDX spec does not have support to emulate the WBINVD
> instruction. So, add support to skip WBINVD instruction in
> drivers that are currently enabled in the TDX guest.
>
> Functionally only devices outside the CPU (such as DMA devices,
> or persistent memory for flushing) can notice the external side
> effects from WBINVD's cache flushing for write back mappings.
> One exception here is MKTME, but that is not visible outside
> the TDX module and not possible inside a TDX guest.
>
> Currently TDX does not support DMA, because DMA typically needs
> uncached access for MMIO, and the current TDX module always
> sets the IgnorePAT bit, which prevents that.
>
> Persistent memory is also currently not supported. Another code
> path that uses WBINVD is the MTRR driver, but EPT/virtualization
> always disables MTRRs so those are not needed. This all implies
> WBINVD is not needed with current TDX.
>
> So, most drivers/code-paths that use wbinvd instructions are
> already disabled for TDX guest platforms via config-option/BIOS.
> Following are the list of drivers that use wbinvd instruction
> and are still enabled for TDX guests.
>
> drivers/acpi/sleep.c
> drivers/acpi/acpica/hwsleep.c
>
> Since cache is always coherent in TDX guests, making wbinvd as
> noop should not cause any issues. This behavior is the same as
> KVM guest.
>
> Also, hwsleep shouldn't happen for TDX guest because the TDX
> BIOS won't enable it, but it's better to disable it anyways
>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>
> Changes since RFC v2-fix-v2:
>  * Instead of handling WBINVD #VE exception as nop, we skip its
>    usage in currently enabled drivers.
>  * Adapted commit log for above change.
>
>  arch/x86/kernel/tdx.c           |  1 +
>  drivers/acpi/acpica/hwsleep.c   | 12 +++++++++---
>  drivers/acpi/sleep.c            | 26 +++++++++++++++++++++++---
>  include/linux/protected_guest.h |  2 ++
>  4 files changed, 35 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 1caf9fa5bb30..e33928131e6a 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -100,6 +100,7 @@ bool tdx_protected_guest_has(unsigned long flag)
>         case PR_GUEST_MEM_ENCRYPT_ACTIVE:
>         case PR_GUEST_UNROLL_STRING_IO:
>         case PR_GUEST_SHARED_MAPPING_INIT:
> +       case PR_GUEST_DISABLE_WBINVD:
>                 return true;
>         }
>
> diff --git a/drivers/acpi/acpica/hwsleep.c b/drivers/acpi/acpica/hwsleep.c
> index 14baa13bf848..9d40df1b8a74 100644
> --- a/drivers/acpi/acpica/hwsleep.c
> +++ b/drivers/acpi/acpica/hwsleep.c
> @@ -9,6 +9,7 @@
>   *****************************************************************************/
>
>  #include <acpi/acpi.h>
> +#include <linux/protected_guest.h>
>  #include "accommon.h"
>
>  #define _COMPONENT          ACPI_HARDWARE
> @@ -108,9 +109,14 @@ acpi_status acpi_hw_legacy_sleep(u8 sleep_state)
>         pm1a_control |= sleep_enable_reg_info->access_bit_mask;
>         pm1b_control |= sleep_enable_reg_info->access_bit_mask;
>
> -       /* Flush caches, as per ACPI specification */
> -
> -       ACPI_FLUSH_CPU_CACHE();
> +       /*
> +        * WBINVD instruction is not supported in TDX
> +        * guest. Since ACPI_FLUSH_CPU_CACHE() uses
> +        * WBINVD, skip cache flushes for TDX guests.
> +        */
> +       if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
> +               /* Flush caches, as per ACPI specification */
> +               ACPI_FLUSH_CPU_CACHE();
>
>         status = acpi_os_enter_sleep(sleep_state, pm1a_control, pm1b_control);
>         if (status == AE_CTRL_TERMINATE) {
> diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
> index df386571da98..3d6c213481f0 100644
> --- a/drivers/acpi/sleep.c
> +++ b/drivers/acpi/sleep.c
> @@ -18,6 +18,7 @@
>  #include <linux/acpi.h>
>  #include <linux/module.h>
>  #include <linux/syscore_ops.h>
> +#include <linux/protected_guest.h>
>  #include <asm/io.h>
>  #include <trace/events/power.h>
>
> @@ -71,7 +72,14 @@ static int acpi_sleep_prepare(u32 acpi_state)
>                 acpi_set_waking_vector(acpi_wakeup_address);
>
>         }
> -       ACPI_FLUSH_CPU_CACHE();
> +
> +       /*
> +        * WBINVD instruction is not supported in TDX
> +        * guest. Since ACPI_FLUSH_CPU_CACHE() uses
> +        * WBINVD, skip cache flushes for TDX guests.
> +        */
> +       if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
> +               ACPI_FLUSH_CPU_CACHE();
>  #endif
>         printk(KERN_INFO PREFIX "Preparing to enter system sleep state S%d\n",
>                 acpi_state);
> @@ -566,7 +574,13 @@ static int acpi_suspend_enter(suspend_state_t pm_state)
>         u32 acpi_state = acpi_target_sleep_state;
>         int error;
>
> -       ACPI_FLUSH_CPU_CACHE();
> +       /*
> +        * WBINVD instruction is not supported in TDX
> +        * guest. Since ACPI_FLUSH_CPU_CACHE() uses
> +        * WBINVD, skip cache flushes for TDX guests.
> +        */
> +       if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
> +               ACPI_FLUSH_CPU_CACHE();
>
>         trace_suspend_resume(TPS("acpi_suspend"), acpi_state, true);
>         switch (acpi_state) {
> @@ -899,7 +913,13 @@ static int acpi_hibernation_enter(void)
>  {
>         acpi_status status = AE_OK;
>
> -       ACPI_FLUSH_CPU_CACHE();
> +       /*
> +        * WBINVD instruction is not supported in TDX
> +        * guest. Since ACPI_FLUSH_CPU_CACHE() uses
> +        * WBINVD, skip cache flushes for TDX guests.
> +        */
> +       if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
> +               ACPI_FLUSH_CPU_CACHE();
>
>         /* This shouldn't return.  If it returns, we have a problem */
>         status = acpi_enter_sleep_state(ACPI_STATE_S4);
> diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
> index adfa62e2615e..0ec4dab86f67 100644
> --- a/include/linux/protected_guest.h
> +++ b/include/linux/protected_guest.h
> @@ -18,6 +18,8 @@
>  #define PR_GUEST_HOST_MEM_ENCRYPT              0x103
>  /* Support for shared mapping initialization (after early init) */
>  #define PR_GUEST_SHARED_MAPPING_INIT           0x104
> +/* Support to disable WBINVD */
> +#define PR_GUEST_DISABLE_WBINVD                        0x105
>
>  #if defined(CONFIG_INTEL_TDX_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-08 21:35                                                   ` [RFC v2-fix-v3 1/1] x86/tdx: Skip " Kuppuswamy Sathyanarayanan
  2021-06-08 21:41                                                     ` Dan Williams
@ 2021-06-08 22:17                                                     ` Dave Hansen
  2021-06-08 22:34                                                       ` Andi Kleen
  2021-06-08 22:36                                                       ` Kuppuswamy, Sathyanarayanan
  2021-06-08 23:32                                                     ` Dan Williams
  2 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-06-08 22:17 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 6/8/21 2:35 PM, Kuppuswamy Sathyanarayanan wrote:
> Persistent memory is also currently not supported. Another code
> path that uses WBINVD is the MTRR driver, but EPT/virtualization
> always disables MTRRs so those are not needed. This all implies
> WBINVD is not needed with current TDX.

It's one thing to declare something unsupported.  It's quite another to
declare it unsupported and then back it up with code to ensure that any
attempted use is thwarted.

This patch certainly shows us half of the solution.  But, to be
complete, we also need to see the other half: where is the patch or
documentation for why it is not *possible* to encounter persistent
memory in a TDX guest?

BTW, "persistent memory" is much more than Intel Optane DCPMM.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-08 22:17                                                     ` Dave Hansen
@ 2021-06-08 22:34                                                       ` Andi Kleen
  2021-06-08 22:36                                                       ` Kuppuswamy, Sathyanarayanan
  1 sibling, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-06-08 22:34 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Dan Williams
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel


On 6/8/2021 3:17 PM, Dave Hansen wrote:
> On 6/8/21 2:35 PM, Kuppuswamy Sathyanarayanan wrote:
>> Persistent memory is also currently not supported. Another code
>> path that uses WBINVD is the MTRR driver, but EPT/virtualization
>> always disables MTRRs so those are not needed. This all implies
>> WBINVD is not needed with current TDX.
> It's one thing to declare something unsupported.  It's quite another to
> declare it unsupported and then back it up with code to ensure that any
> attempted use is thwarted.
>
> This patch certainly shows us half of the solution.  But, to be
> complete, we also need to see the other half: where is the patch


We had multiple patches to handle it earlier (by ignoring it which is 
the right way and deployed successfully everywhere in KVM), but you guys 
all didn't like them.

So they got removed.

You can't have your cake and eat it. Either you have the ignore or warn 
on patches or you have panic.

In this iteration now you have panic (through the exception handler) 
except we explicitely ignore it for the cases we know that can happen 
(which is reboot)


> or
> documentation for why it is not *possible* to encounter persistent
> memory in a TDX guest?


I thought we already went over this ad nauseam.

The current TDX VMMs don't support anything else than plain DRAM.

If there is support for anything else in the future we'll need to add a 
new GHCI call that implements WBINVD through the host, but right now we 
don't need it.

-Andi.


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-08 22:17                                                     ` Dave Hansen
  2021-06-08 22:34                                                       ` Andi Kleen
@ 2021-06-08 22:36                                                       ` Kuppuswamy, Sathyanarayanan
  2021-06-08 22:53                                                         ` Dave Hansen
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-08 22:36 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 6/8/21 3:17 PM, Dave Hansen wrote:
> On 6/8/21 2:35 PM, Kuppuswamy Sathyanarayanan wrote:
>> Persistent memory is also currently not supported. Another code
>> path that uses WBINVD is the MTRR driver, but EPT/virtualization
>> always disables MTRRs so those are not needed. This all implies
>> WBINVD is not needed with current TDX.
> 
> It's one thing to declare something unsupported.  It's quite another to
> declare it unsupported and then back it up with code to ensure that any
> attempted use is thwarted.

Only audited and supported drivers will be allowed to enumerate after
device filter support patch is merged. Till we merge that patch, If
any of these unsupported features (with WBINVD usage) are enabled in TDX,
it will lead to sigfault (due to unhandled #VE).

In this patch we only create exception for ACPI sleep driver code. If
commit log is confusing, I can remove information about other unsupported
feature (with WBINVD usage).

> 
> This patch certainly shows us half of the solution.  But, to be
> complete, we also need to see the other half: where is the patch or
> documentation for why it is not *possible* to encounter persistent
> memory in a TDX guest?
> 
> BTW, "persistent memory" is much more than Intel Optane DCPMM.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-08 22:36                                                       ` Kuppuswamy, Sathyanarayanan
@ 2021-06-08 22:53                                                         ` Dave Hansen
  2021-06-08 23:04                                                           ` Andi Kleen
  2021-06-08 23:04                                                           ` Kuppuswamy, Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-06-08 22:53 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 6/8/21 3:36 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 6/8/21 3:17 PM, Dave Hansen wrote:
>> On 6/8/21 2:35 PM, Kuppuswamy Sathyanarayanan wrote:
>>> Persistent memory is also currently not supported. Another code
>>> path that uses WBINVD is the MTRR driver, but EPT/virtualization
>>> always disables MTRRs so those are not needed. This all implies
>>> WBINVD is not needed with current TDX.
>>
>> It's one thing to declare something unsupported.  It's quite another to
>> declare it unsupported and then back it up with code to ensure that any
>> attempted use is thwarted.
> 
> Only audited and supported drivers will be allowed to enumerate after
> device filter support patch is merged. Till we merge that patch, If
> any of these unsupported features (with WBINVD usage) are enabled in TDX,
> it will lead to sigfault (due to unhandled #VE).

A kernel driver using WBINVD will "sigfault"?  I'm not sure what that
means.  How does the kernel "sigfault"?

> In this patch we only create exception for ACPI sleep driver code. If
> commit log is confusing, I can remove information about other unsupported
> feature (with WBINVD usage).

Yes, the changelog is horribly confusing.  But simply removing this
information is insufficient to rectify the deficiency.

I've lost trust that due diligence will be performed on this series on
its own.  I've seen too many broken promises and too many holes.

Here's what I want to see: a list of all of the unique call sites for
WBINVD in the kernel.  I want a written down methodology for how the
list of call sites was generated.  I want to see an item-by-item list of
why those call sites are unreachable with the TDX guest code.  It might
be because they've been patched in this patch, or the driver has been
disabled, or because the TDX architecture spec would somehow prohibit
the situation where it might be needed.  But, there needs to be a list,
and you have to show your work.  If you refer to code from this series
as helping to prevent WBINVD, then it has to be earlier in this series,
not in some other series and not later in this series.

Just eyeballing it, there are ~50 places in the kernel that need auditing.

Right now, we mostly have indiscriminate hand-waving about this not
being a problem.  It's a hard NAK from me on this patch until this audit
is in place.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-08 22:53                                                         ` Dave Hansen
@ 2021-06-08 23:04                                                           ` Andi Kleen
  2021-06-08 23:04                                                           ` Kuppuswamy, Sathyanarayanan
  1 sibling, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-06-08 23:04 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy, Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Dan Williams
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel

\
> A kernel driver using WBINVD will "sigfault"?  I'm not sure what that
> means.  How does the kernel "sigfault"?

It panics. Please, you know exactly what Sathya meant because you've 
read the code.

>
> Here's what I want to see: a list of all of the unique call sites for
> WBINVD in the kernel.  I want a written down methodology for how the
> list of call sites was generated.  I want to see an item-by-item list of
> why those call sites are unreachable with the TDX guest code.  It might
> be because they've been patched in this patch, or the driver has been
> disabled, or because the TDX architecture spec would somehow prohibit
> the situation where it might be needed.  But, there needs to be a list,
> and you have to show your work.  If you refer to code from this series
> as helping to prevent WBINVD, then it has to be earlier in this series,
> not in some other series and not later in this series.

Sorry this is ridiculous. We're not in a make-work project here. We're 
about practical engineering  not make out life artificially complicated.

If that is what is required then the change requests to NOT ignore but 
patch every site were just not practical.

>
> Just eyeballing it, there are ~50 places in the kernel that need auditing.
>
> Right now, we mostly have indiscriminate hand-waving about this not
> being a problem.  It's a hard NAK from me on this patch until this audit
> is in place.


Okay then we just go back to ignore like the rest of the KVM world.

That's what we had originally and it it's fine because it's exactly what 
KVM does, which is all we want.

It was the sane thing to do and it's still the sane thing to do because 
it has been always done this way.

-And


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-08 22:53                                                         ` Dave Hansen
  2021-06-08 23:04                                                           ` Andi Kleen
@ 2021-06-08 23:04                                                           ` Kuppuswamy, Sathyanarayanan
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-08 23:04 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 6/8/21 3:53 PM, Dave Hansen wrote:
> On 6/8/21 3:36 PM, Kuppuswamy, Sathyanarayanan wrote:
>> On 6/8/21 3:17 PM, Dave Hansen wrote:
>>> On 6/8/21 2:35 PM, Kuppuswamy Sathyanarayanan wrote:

> 
> A kernel driver using WBINVD will "sigfault"?  I'm not sure what that
> means.  How does the kernel "sigfault"?

Sorry, un-supported #VE is handled similar to #GP fault.

> 
>> In this patch we only create exception for ACPI sleep driver code. If
>> commit log is confusing, I can remove information about other unsupported
>> feature (with WBINVD usage).
> 
> Yes, the changelog is horribly confusing.  But simply removing this
> information is insufficient to rectify the deficiency.

I will remove all the unrelated information from this commit log. As long as
commit log *only* talks and handles the exception for ACPI sleep driver, it
should be acceptable for you right? I will also add a note about, if any
other feature with WBINVD usage is enabled, it would lead to #GP fault.

> 
> I've lost trust that due diligence will be performed on this series on
> its own.  I've seen too many broken promises and too many holes.
> 
> Here's what I want to see: a list of all of the unique call sites for
> WBINVD in the kernel.  I want a written down methodology for how the
> list of call sites was generated.  I want to see an item-by-item list of
> why those call sites are unreachable with the TDX guest code.  It might
> be because they've been patched in this patch, or the driver has been
> disabled, or because the TDX architecture spec would somehow prohibit
> the situation where it might be needed.  But, there needs to be a list,
> and you have to show your work.  If you refer to code from this series
> as helping to prevent WBINVD, then it has to be earlier in this series,
> not in some other series and not later in this series.
> 
> Just eyeballing it, there are ~50 places in the kernel that need auditing.
> 
> Right now, we mostly have indiscriminate hand-waving about this not
> being a problem.  It's a hard NAK from me on this patch until this audit
> is in place.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/3] x86/tdx: Handle port I/O in decompression code
  2021-06-08 15:40                         ` [RFC v2-fix-v2 1/3] x86/tdx: Handle port I/O in decompression code Kuppuswamy Sathyanarayanan
@ 2021-06-08 23:12                           ` Dan Williams
  0 siblings, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-06-08 23:12 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Tue, Jun 8, 2021 at 8:40 AM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> Add support to replace in/out instructions in
> decompression code with TDX IO hypercalls.
>
> TDX cannot do port IO directly. The TDX module triggers
> a #VE exception to let the guest kernel to emulate port
> I/O, by converting them into TDX hypercalls to call the
> host.
>
> But for the really early code in the decompressor, #VE
> cannot be used because the IDT needed for handling the
> exception is not set-up, and some other infrastructure
> needed by the handler is missing. So to support port IO
> in decompressor code, directly replace in/out instructions
> with TDX IO hypercalls. This can beeasily achieved by
> modifying __in/__out macros.
>
> Also, since TDX IO hypercall requires an IO size parameter,
> modify __in/__out macros to accept size as input parameter.

Looks good to me:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/boot: Avoid #VE during boot for TDX platforms
  2021-05-27 21:25                     ` [RFC v2-fix-v4 " Kuppuswamy Sathyanarayanan
@ 2021-06-08 23:14                       ` Dan Williams
  0 siblings, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-06-08 23:14 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Thu, May 27, 2021 at 2:25 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> There are a few MSRs and control register bits which the kernel
> normally needs to modify during boot. But, TDX disallows
> modification of these registers to help provide consistent
> security guarantees. Fortunately, TDX ensures that these are all
> in the correct state before the kernel loads, which means the
> kernel has no need to modify them.
>
> The conditions to avoid are:
>
>   * Any writes to the EFER MSR
>   * Clearing CR0.NE
>   * Clearing CR3.MCE
>
> This theoretically makes guest boot more fragile. If, for
> instance, EFER was set up incorrectly and a WRMSR was performed,
> it will trigger early exception panic or a triple fault, if it's
> before early exceptions are set up. However, this is likely to
> trip up the guest BIOS long before control reaches the kernel. In
> any case, these kinds of problems are unlikely to occur in
> production environments, and developers have good debug
> tools to fix them quickly.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Looks good to me:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-08 21:35                                                   ` [RFC v2-fix-v3 1/1] x86/tdx: Skip " Kuppuswamy Sathyanarayanan
  2021-06-08 21:41                                                     ` Dan Williams
  2021-06-08 22:17                                                     ` Dave Hansen
@ 2021-06-08 23:32                                                     ` Dan Williams
  2021-06-08 23:38                                                       ` Dave Hansen
  2 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-06-08 23:32 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Rafael J. Wysocki, Linux ACPI

On Tue, Jun 8, 2021 at 2:35 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> Current TDX spec does not have support to emulate the WBINVD
> instruction. So, add support to skip WBINVD instruction in
> drivers that are currently enabled in the TDX guest.
>
> Functionally only devices outside the CPU (such as DMA devices,
> or persistent memory for flushing) can notice the external side
> effects from WBINVD's cache flushing for write back mappings.
> One exception here is MKTME, but that is not visible outside
> the TDX module and not possible inside a TDX guest.
>
> Currently TDX does not support DMA, because DMA typically needs
> uncached access for MMIO, and the current TDX module always
> sets the IgnorePAT bit, which prevents that.
>
> Persistent memory is also currently not supported. Another code
> path that uses WBINVD is the MTRR driver, but EPT/virtualization
> always disables MTRRs so those are not needed. This all implies
> WBINVD is not needed with current TDX.

Let's drop the last three paragraphs and just say something like:
"This is one of a series of patches to usages of wbinvd for protected
guests. For now this just addresses the one known path that TDX
executes, ACPI reboot. Its usage can be elided because FOO reason and
all the other ACPI_FLUSH_CPU_CACHE usages can be elided because BAR
reason"

>
> So, most drivers/code-paths that use wbinvd instructions are
> already disabled for TDX guest platforms via config-option/BIOS.
> Following are the list of drivers that use wbinvd instruction
> and are still enabled for TDX guests.
>
> drivers/acpi/sleep.c
> drivers/acpi/acpica/hwsleep.c
>
> Since cache is always coherent in TDX guests, making wbinvd as
> noop should not cause any issues. This behavior is the same as
> KVM guest.
>
> Also, hwsleep shouldn't happen for TDX guest because the TDX
> BIOS won't enable it, but it's better to disable it anyways
>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>
> Changes since RFC v2-fix-v2:
>  * Instead of handling WBINVD #VE exception as nop, we skip its
>    usage in currently enabled drivers.
>  * Adapted commit log for above change.
>
>  arch/x86/kernel/tdx.c           |  1 +
>  drivers/acpi/acpica/hwsleep.c   | 12 +++++++++---
>  drivers/acpi/sleep.c            | 26 +++++++++++++++++++++++---
>  include/linux/protected_guest.h |  2 ++
>  4 files changed, 35 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 1caf9fa5bb30..e33928131e6a 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -100,6 +100,7 @@ bool tdx_protected_guest_has(unsigned long flag)
>         case PR_GUEST_MEM_ENCRYPT_ACTIVE:
>         case PR_GUEST_UNROLL_STRING_IO:
>         case PR_GUEST_SHARED_MAPPING_INIT:
> +       case PR_GUEST_DISABLE_WBINVD:
>                 return true;
>         }
>
> diff --git a/drivers/acpi/acpica/hwsleep.c b/drivers/acpi/acpica/hwsleep.c
> index 14baa13bf848..9d40df1b8a74 100644
> --- a/drivers/acpi/acpica/hwsleep.c
> +++ b/drivers/acpi/acpica/hwsleep.c
> @@ -9,6 +9,7 @@
>   *****************************************************************************/
>
>  #include <acpi/acpi.h>
> +#include <linux/protected_guest.h>
>  #include "accommon.h"
>
>  #define _COMPONENT          ACPI_HARDWARE
> @@ -108,9 +109,14 @@ acpi_status acpi_hw_legacy_sleep(u8 sleep_state)
>         pm1a_control |= sleep_enable_reg_info->access_bit_mask;
>         pm1b_control |= sleep_enable_reg_info->access_bit_mask;
>
> -       /* Flush caches, as per ACPI specification */
> -
> -       ACPI_FLUSH_CPU_CACHE();
> +       /*
> +        * WBINVD instruction is not supported in TDX
> +        * guest. Since ACPI_FLUSH_CPU_CACHE() uses
> +        * WBINVD, skip cache flushes for TDX guests.
> +        */
> +       if (prot_guest_has(PR_GUEST_DISABLE_WBINVD))
> +               /* Flush caches, as per ACPI specification */
> +               ACPI_FLUSH_CPU_CACHE();

ACPICA uses OS abstractions like ACPI_FLUSH_CPU_CACHE and Linux
patches rarely (never?) change ACPICA directly. If you want to change
ACPICA it goes through the ACPICA project first and is then
"Linux-ized", but in this case I believe you do not need to go that
path. Instead, this wants to change the definition of
ACPI_FLUSH_CPU_CACHE() directly in arch/x86/include/asm/acenv.h and
explain why the other ACPI cache flushing paths / requirements do not
apply to TDX guests.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-08 23:32                                                     ` Dan Williams
@ 2021-06-08 23:38                                                       ` Dave Hansen
  2021-06-09  0:07                                                         ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-06-08 23:38 UTC (permalink / raw)
  To: Dan Williams, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List,
	Rafael J. Wysocki, Linux ACPI

On 6/8/21 4:32 PM, Dan Williams wrote:
>> Persistent memory is also currently not supported. Another code
>> path that uses WBINVD is the MTRR driver, but EPT/virtualization
>> always disables MTRRs so those are not needed. This all implies
>> WBINVD is not needed with current TDX.
> Let's drop the last three paragraphs and just say something like:
> "This is one of a series of patches to usages of wbinvd for protected
> guests. For now this just addresses the one known path that TDX
> executes, ACPI reboot. Its usage can be elided because FOO reason and
> all the other ACPI_FLUSH_CPU_CACHE usages can be elided because BAR
> reason"

A better effort at transparency can be made here:

	This patches the one WBINVD instance which has been encountered
	in practice: ACPI reboot.  Assume no other instance will be
	encountered.


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-08 23:38                                                       ` Dave Hansen
@ 2021-06-09  0:07                                                         ` Dan Williams
  2021-06-09  0:14                                                           ` Kuppuswamy, Sathyanarayanan
  2021-06-09  1:10                                                           ` [RFC v2-fix-v4 " Kuppuswamy Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Dan Williams @ 2021-06-09  0:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List, Rafael J. Wysocki, Linux ACPI

On Tue, Jun 8, 2021 at 4:38 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/8/21 4:32 PM, Dan Williams wrote:
> >> Persistent memory is also currently not supported. Another code
> >> path that uses WBINVD is the MTRR driver, but EPT/virtualization
> >> always disables MTRRs so those are not needed. This all implies
> >> WBINVD is not needed with current TDX.
> > Let's drop the last three paragraphs and just say something like:
> > "This is one of a series of patches to usages of wbinvd for protected
> > guests. For now this just addresses the one known path that TDX
> > executes, ACPI reboot. Its usage can be elided because FOO reason and
> > all the other ACPI_FLUSH_CPU_CACHE usages can be elided because BAR
> > reason"
>
> A better effort at transparency can be made here:
>
>         This patches the one WBINVD instance which has been encountered
>         in practice: ACPI reboot.  Assume no other instance will be
>         encountered.
>

That works too, but I assume if ACPI_FLUSH_CPU_CACHE() itself is going
to be changed rather than sprinkling protected_guest_has() checks in a
few places it will need to assert why changing all of those at once is
correct. Otherwise I expect Rafael to ask why this global change of
the ACPI_FLUSH_CPU_CACHE() policy is ok.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v3 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09  0:07                                                         ` Dan Williams
@ 2021-06-09  0:14                                                           ` Kuppuswamy, Sathyanarayanan
  2021-06-09  1:10                                                           ` [RFC v2-fix-v4 " Kuppuswamy Sathyanarayanan
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-09  0:14 UTC (permalink / raw)
  To: Dan Williams, Dave Hansen
  Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List,
	Rafael J. Wysocki, Linux ACPI



On 6/8/21 5:07 PM, Dan Williams wrote:
> That works too, but I assume if ACPI_FLUSH_CPU_CACHE() itself is going
> to be changed rather than sprinkling protected_guest_has() checks in a
> few places it will need to assert why changing all of those at once is
> correct. Otherwise I expect Rafael to ask why this global change of
> the ACPI_FLUSH_CPU_CACHE() policy is ok.

Yes. I am fixing it as below.

--- a/arch/x86/include/asm/acenv.h
+++ b/arch/x86/include/asm/acenv.h
@@ -10,10 +10,15 @@
  #define _ASM_X86_ACENV_H

  #include <asm/special_insns.h>
+#include <asm/protected_guest.h>

  /* Asm macros */

-#define ACPI_FLUSH_CPU_CACHE() wbinvd()
+#define ACPI_FLUSH_CPU_CACHE()                         \
+do {                                                   \
+       if (!prot_guest_has(PR_GUEST_DISABLE_WBINVD))   \
+               wbinvd();                               \
+} while (0)


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09  0:07                                                         ` Dan Williams
  2021-06-09  0:14                                                           ` Kuppuswamy, Sathyanarayanan
@ 2021-06-09  1:10                                                           ` Kuppuswamy Sathyanarayanan
  2021-06-09  3:40                                                             ` Dan Williams
  2021-06-09 14:12                                                             ` Dave Hansen
  1 sibling, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-09  1:10 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Current TDX spec does not have support to emulate the WBINVD
instruction. If any feature that uses WBINVD is enabled/used
in TDX guest, it will lead to un-handled #VE exception, which
will be handled as #GP fault.

ACPI drivers also uses WBINVD instruction for cache flushes in
reboot or shutdown code path. Since TDX guest has requirement
to support shutdown feature, skip WBINVD instruction usage
in ACPI drivers for TDX guest.

Since cache is always coherent in TDX guests, making wbinvd as
noop should not cause any issues in above mentioned code path.
The end-behavior is the same as KVM guest (treat as noops).

In future, once TDX guest specification adds support for WBINVD
hypercall, we can pass the handle to KVM to handle it.
   
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-fix-v3:
 * Fixed commit log as per review comments.
 * Instead of fixing all usages of ACPI_FLUSH_CPU_CACHE(),
   created TDX specific exception for it in its implementation.

Changes since RFC v2-fix-v2:
 * Instead of handling WBINVD #VE exception as nop, we skip its
   usage in currently enabled drivers.
 * Adapted commit log for above change.

 arch/x86/include/asm/acenv.h    | 7 ++++++-
 arch/x86/kernel/tdx.c           | 1 +
 include/linux/protected_guest.h | 2 ++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/acenv.h b/arch/x86/include/asm/acenv.h
index 9aff97f0de7f..36c87b69366b 100644
--- a/arch/x86/include/asm/acenv.h
+++ b/arch/x86/include/asm/acenv.h
@@ -10,10 +10,15 @@
 #define _ASM_X86_ACENV_H
 
 #include <asm/special_insns.h>
+#include <linux/protected_guest.h>
 
 /* Asm macros */
 
-#define ACPI_FLUSH_CPU_CACHE()	wbinvd()
+#define ACPI_FLUSH_CPU_CACHE()				\
+do {							\
+	if (!prot_guest_has(PR_GUEST_DISABLE_WBINVD))	\
+		wbinvd();				\
+} while (0)
 
 int __acpi_acquire_global_lock(unsigned int *lock);
 int __acpi_release_global_lock(unsigned int *lock);
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 06fcbca402cb..fd27cf651f0b 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -92,6 +92,7 @@ bool tdx_protected_guest_has(unsigned long flag)
 	case PR_GUEST_MEM_ENCRYPT_ACTIVE:
 	case PR_GUEST_UNROLL_STRING_IO:
 	case PR_GUEST_SHARED_MAPPING_INIT:
+	case PR_GUEST_DISABLE_WBINVD:
 		return true;
 	}
 
diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
index adfa62e2615e..0ec4dab86f67 100644
--- a/include/linux/protected_guest.h
+++ b/include/linux/protected_guest.h
@@ -18,6 +18,8 @@
 #define PR_GUEST_HOST_MEM_ENCRYPT		0x103
 /* Support for shared mapping initialization (after early init) */
 #define PR_GUEST_SHARED_MAPPING_INIT		0x104
+/* Support to disable WBINVD */
+#define PR_GUEST_DISABLE_WBINVD			0x105
 
 #if defined(CONFIG_INTEL_TDX_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09  1:10                                                           ` [RFC v2-fix-v4 " Kuppuswamy Sathyanarayanan
@ 2021-06-09  3:40                                                             ` Dan Williams
  2021-06-09  3:56                                                               ` Kuppuswamy, Sathyanarayanan
  2021-06-09  4:02                                                               ` [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest Andy Lutomirski
  2021-06-09 14:12                                                             ` Dave Hansen
  1 sibling, 2 replies; 381+ messages in thread
From: Dan Williams @ 2021-06-09  3:40 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Tue, Jun 8, 2021 at 6:10 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> Current TDX spec does not have support to emulate the WBINVD
> instruction. If any feature that uses WBINVD is enabled/used
> in TDX guest, it will lead to un-handled #VE exception, which
> will be handled as #GP fault.
>
> ACPI drivers also uses WBINVD instruction for cache flushes in
> reboot or shutdown code path. Since TDX guest has requirement
> to support shutdown feature, skip WBINVD instruction usage
> in ACPI drivers for TDX guest.

This sounds awkward...

> Since cache is always coherent in TDX guests, making wbinvd as

This is incorrect, ACPI cache flushing is not about I/O or CPU coherency...

> noop should not cause any issues in above mentioned code path.

..."should" is a famous last word...

> The end-behavior is the same as KVM guest (treat as noops).

..."KVM gets away with it" is not a justification that TDX can stand
on otherwise we would not be here fixing up ACPICA properly.

How about:

"TDX guests use standard ACPI mechanisms to signal sleep state entry
(including reboot) to the host. The ACPI specification mandates WBINVD
on any sleep state entry with the expectation that the platform is
only responsible for maintaining the state of memory over sleep
states, not preserving dirty data in any CPU caches. ACPI cache
flushing requirements pre-date the advent of virtualization. Given TDX
guest sleep state entry does not affect any host power rails it is not
required to flush caches. The host is responsible for maintaining
cache state over its own bare metal sleep state transitions that
power-off the cache. If the host fails to manage caches over its sleep
state transitions the guest..."

I don't know how to finish the last sentence. What does TDX do if it
is resumed after host suspend and the host somehow arranged for dirty
TDX lines to be lost. Will that be noticed by TDX integrity
mechanisms? I did not immediately find an answer to this with a brief
look at the specs.

>
> In future, once TDX guest specification adds support for WBINVD
> hypercall, we can pass the handle to KVM to handle it.

I expect if the specification wanted operating systems to plan for
this eventuality it would have made a note of it. I expect this
sentence can just be deleted.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09  3:40                                                             ` Dan Williams
@ 2021-06-09  3:56                                                               ` Kuppuswamy, Sathyanarayanan
  2021-06-09  4:19                                                                 ` Dan Williams
  2021-06-09  4:02                                                               ` [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest Andy Lutomirski
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-09  3:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List



On 6/8/21 8:40 PM, Dan Williams wrote:
> ..."KVM gets away with it" is not a justification that TDX can stand
> on otherwise we would not be here fixing up ACPICA properly.
> 
> How about:
> 
> "TDX guests use standard ACPI mechanisms to signal sleep state entry
> (including reboot) to the host. The ACPI specification mandates WBINVD
> on any sleep state entry with the expectation that the platform is
> only responsible for maintaining the state of memory over sleep
> states, not preserving dirty data in any CPU caches. ACPI cache
> flushing requirements pre-date the advent of virtualization. Given TDX
> guest sleep state entry does not affect any host power rails it is not
> required to flush caches. The host is responsible for maintaining
> cache state over its own bare metal sleep state transitions that
> power-off the cache. If the host fails to manage caches over its sleep
> state transitions the guest..."

> 
> I don't know how to finish the last sentence. What does TDX do if it
> is resumed after host suspend and the host somehow arranged for dirty
> TDX lines to be lost.

TDX guest does not support S3. It will be disabled in ACPI tables. It
is a TDX firmware spec requirement. Please check the following spec,
sec 2.1

https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf

In TDX guest, we encounter cache flushes only in shutdown and reboot path.
So there is no resume path.


  Will that be noticed by TDX integrity
> mechanisms? I did not immediately find an answer to this with a brief
> look at the specs.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09  3:40                                                             ` Dan Williams
  2021-06-09  3:56                                                               ` Kuppuswamy, Sathyanarayanan
@ 2021-06-09  4:02                                                               ` Andy Lutomirski
  2021-06-09  4:21                                                                 ` Dan Williams
  2021-06-09  4:25                                                                 ` Andi Kleen
  1 sibling, 2 replies; 381+ messages in thread
From: Andy Lutomirski @ 2021-06-09  4:02 UTC (permalink / raw)
  To: Dan Williams, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Dave Hansen, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List

On 6/8/21 8:40 PM, Dan Williams wrote:
> On Tue, Jun 8, 2021 at 6:10 PM Kuppuswamy Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>>
>> Current TDX spec does not have support to emulate the WBINVD
>> instruction. If any feature that uses WBINVD is enabled/used
>> in TDX guest, it will lead to un-handled #VE exception, which
>> will be handled as #GP fault.
>>
>> ACPI drivers also uses WBINVD instruction for cache flushes in
>> reboot or shutdown code path. Since TDX guest has requirement
>> to support shutdown feature, skip WBINVD instruction usage
>> in ACPI drivers for TDX guest.
> 
> This sounds awkward...
> 
>> Since cache is always coherent in TDX guests, making wbinvd as
> 
> This is incorrect, ACPI cache flushing is not about I/O or CPU coherency...
> 
>> noop should not cause any issues in above mentioned code path.
> 
> ..."should" is a famous last word...
> 
>> The end-behavior is the same as KVM guest (treat as noops).
> 
> ..."KVM gets away with it" is not a justification that TDX can stand
> on otherwise we would not be here fixing up ACPICA properly.
> 
> How about:
> 
> "TDX guests use standard ACPI mechanisms to signal sleep state entry
> (including reboot) to the host. The ACPI specification mandates WBINVD
> on any sleep state entry with the expectation that the platform is
> only responsible for maintaining the state of memory over sleep
> states, not preserving dirty data in any CPU caches. ACPI cache
> flushing requirements pre-date the advent of virtualization. Given TDX
> guest sleep state entry does not affect any host power rails it is not
> required to flush caches. The host is responsible for maintaining
> cache state over its own bare metal sleep state transitions that
> power-off the cache. If the host fails to manage caches over its sleep
> state transitions the guest..."
> 

I like this description, but shouldn't the logic be:

if (!CPUID has hypervisor bit set)
  wbinvd();

As far as I know, most hypervisors will turn WBINVD into a noop and,
even if they don't, it seems to be that something must be really quite
wrong for a guest to need to WBINVD for ACPI purposes.

-Andy

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09  3:56                                                               ` Kuppuswamy, Sathyanarayanan
@ 2021-06-09  4:19                                                                 ` Dan Williams
  2021-06-09  4:27                                                                   ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-06-09  4:19 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Tue, Jun 8, 2021 at 8:56 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
>
>
> On 6/8/21 8:40 PM, Dan Williams wrote:
> > ..."KVM gets away with it" is not a justification that TDX can stand
> > on otherwise we would not be here fixing up ACPICA properly.
> >
> > How about:
> >
> > "TDX guests use standard ACPI mechanisms to signal sleep state entry
> > (including reboot) to the host. The ACPI specification mandates WBINVD
> > on any sleep state entry with the expectation that the platform is
> > only responsible for maintaining the state of memory over sleep
> > states, not preserving dirty data in any CPU caches. ACPI cache
> > flushing requirements pre-date the advent of virtualization. Given TDX
> > guest sleep state entry does not affect any host power rails it is not
> > required to flush caches. The host is responsible for maintaining
> > cache state over its own bare metal sleep state transitions that
> > power-off the cache. If the host fails to manage caches over its sleep
> > state transitions the guest..."
>
> >
> > I don't know how to finish the last sentence. What does TDX do if it
> > is resumed after host suspend and the host somehow arranged for dirty
> > TDX lines to be lost.
>
> TDX guest does not support S3. It will be disabled in ACPI tables. It
> is a TDX firmware spec requirement. Please check the following spec,
> sec 2.1
>
> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf

I'm not asking about TDX guest entering S3...

>
> In TDX guest, we encounter cache flushes only in shutdown and reboot path.
> So there is no resume path.

Host is free to go into S3 independent of any guest state. A hostile
host is free to do just enough cache management so that it can resume
from S3 while arranging for TDX guest dirty data to be lost. Does a
TDX guest go fatal if the cache loses power?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09  4:02                                                               ` [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest Andy Lutomirski
@ 2021-06-09  4:21                                                                 ` Dan Williams
  2021-06-09  4:25                                                                 ` Andi Kleen
  1 sibling, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-06-09  4:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Dave Hansen,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Tue, Jun 8, 2021 at 9:02 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On 6/8/21 8:40 PM, Dan Williams wrote:
> > On Tue, Jun 8, 2021 at 6:10 PM Kuppuswamy Sathyanarayanan
> > <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> >>
> >> Current TDX spec does not have support to emulate the WBINVD
> >> instruction. If any feature that uses WBINVD is enabled/used
> >> in TDX guest, it will lead to un-handled #VE exception, which
> >> will be handled as #GP fault.
> >>
> >> ACPI drivers also uses WBINVD instruction for cache flushes in
> >> reboot or shutdown code path. Since TDX guest has requirement
> >> to support shutdown feature, skip WBINVD instruction usage
> >> in ACPI drivers for TDX guest.
> >
> > This sounds awkward...
> >
> >> Since cache is always coherent in TDX guests, making wbinvd as
> >
> > This is incorrect, ACPI cache flushing is not about I/O or CPU coherency...
> >
> >> noop should not cause any issues in above mentioned code path.
> >
> > ..."should" is a famous last word...
> >
> >> The end-behavior is the same as KVM guest (treat as noops).
> >
> > ..."KVM gets away with it" is not a justification that TDX can stand
> > on otherwise we would not be here fixing up ACPICA properly.
> >
> > How about:
> >
> > "TDX guests use standard ACPI mechanisms to signal sleep state entry
> > (including reboot) to the host. The ACPI specification mandates WBINVD
> > on any sleep state entry with the expectation that the platform is
> > only responsible for maintaining the state of memory over sleep
> > states, not preserving dirty data in any CPU caches. ACPI cache
> > flushing requirements pre-date the advent of virtualization. Given TDX
> > guest sleep state entry does not affect any host power rails it is not
> > required to flush caches. The host is responsible for maintaining
> > cache state over its own bare metal sleep state transitions that
> > power-off the cache. If the host fails to manage caches over its sleep
> > state transitions the guest..."
> >
>
> I like this description, but shouldn't the logic be:
>
> if (!CPUID has hypervisor bit set)
>   wbinvd();
>
> As far as I know, most hypervisors will turn WBINVD into a noop and,
> even if they don't, it seems to be that something must be really quite
> wrong for a guest to need to WBINVD for ACPI purposes.

Agree, a well behaved guest should not pretend its callouts to the
virtual ACPI BIOS actually affect a host power rail.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09  4:02                                                               ` [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest Andy Lutomirski
  2021-06-09  4:21                                                                 ` Dan Williams
@ 2021-06-09  4:25                                                                 ` Andi Kleen
  2021-06-09  4:32                                                                   ` Andy Lutomirski
  1 sibling, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-06-09  4:25 UTC (permalink / raw)
  To: Andy Lutomirski, Dan Williams, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List


> I like this description, but shouldn't the logic be:
>
> if (!CPUID has hypervisor bit set)
>    wbinvd();
>
> As far as I know, most hypervisors will turn WBINVD into a noop and,
> even if they don't, it seems to be that something must be really quite
> wrong for a guest to need to WBINVD for ACPI purposes.

KVM only turns it into a noop if there is no VT-d, because with VT-d you 
might need it to turn mappings into uncached and vice versa.

But yes the change would make sense for reboot. BTW I suspect for the 
reboot path it isn't really needed anywhere modern, so it might actually 
be ok to completely disable it. But that's some risk, so doing it only 
for hypervisor is reasonable.

I can see it making sense for the S3 path, but nobody supports S3 for 
guests.

-Andi


>
> -Andy

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09  4:19                                                                 ` Dan Williams
@ 2021-06-09  4:27                                                                   ` Andi Kleen
  2021-06-09 15:09                                                                     ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-06-09  4:27 UTC (permalink / raw)
  To: Dan Williams, Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List


here is no resume path.

> Host is free to go into S3 independent of any guest state.

Actually my understanding is that none of the systems which support TDX 
support S3. S3 has been deprecated for a long time.


>   A hostile
> host is free to do just enough cache management so that it can resume
> from S3 while arranging for TDX guest dirty data to be lost. Does a
> TDX guest go fatal if the cache loses power?

That would be a machine check, and yes it would be fatal.

-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09  4:25                                                                 ` Andi Kleen
@ 2021-06-09  4:32                                                                   ` Andy Lutomirski
  2021-06-09  4:40                                                                     ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Andy Lutomirski @ 2021-06-09  4:32 UTC (permalink / raw)
  To: Andi Kleen, Williams, Dan J, Sathyanarayanan Kuppuswamy
  Cc: Peter Zijlstra (Intel),
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Tue, Jun 8, 2021, at 9:25 PM, Andi Kleen wrote:
> 
> > I like this description, but shouldn't the logic be:
> >
> > if (!CPUID has hypervisor bit set)
> >    wbinvd();
> >
> > As far as I know, most hypervisors will turn WBINVD into a noop and,
> > even if they don't, it seems to be that something must be really quite
> > wrong for a guest to need to WBINVD for ACPI purposes.
> 
> KVM only turns it into a noop if there is no VT-d, because with VT-d you 
> might need it to turn mappings into uncached and vice versa.

Wow, I found the kvm_arch_register_noncoherent_dma() stuff.  That's horrifying.  What's it for?  Are there actually guests that use devices exposed by VFIO that expect WBINVD to work?  That's a giant DoS hole.

> 
> But yes the change would make sense for reboot. BTW I suspect for the 
> reboot path it isn't really needed anywhere modern, so it might actually 
> be ok to completely disable it. But that's some risk, so doing it only 
> for hypervisor is reasonable.

I agree.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09  4:32                                                                   ` Andy Lutomirski
@ 2021-06-09  4:40                                                                     ` Andi Kleen
  2021-06-09  4:54                                                                       ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-06-09  4:40 UTC (permalink / raw)
  To: Andy Lutomirski, Williams, Dan J, Sathyanarayanan Kuppuswamy
  Cc: Peter Zijlstra (Intel),
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List


>> KVM only turns it into a noop if there is no VT-d, because with VT-d you
>> might need it to turn mappings into uncached and vice versa.
> Wow, I found the kvm_arch_register_noncoherent_dma() stuff.  That's horrifying.  What's it for?  e

e.g. if you want to run a GPU it really needs some uncached memory. Same 
is true for other more complex devices.

Now modern Linux of course will be preferring CLFLUSH instead for the 
conversion, but there are old versions that preferred WBINVD.

I don't think it's a DoS, as long as you're not too picky about 
latencies on the host.

-Andi




^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09  4:40                                                                     ` Andi Kleen
@ 2021-06-09  4:54                                                                       ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-09  4:54 UTC (permalink / raw)
  To: Andi Kleen, Andy Lutomirski, Williams, Dan J
  Cc: Peter Zijlstra (Intel),
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List



On 6/8/21 9:40 PM, Andi Kleen wrote:
> 
>>> KVM only turns it into a noop if there is no VT-d, because with VT-d you
>>> might need it to turn mappings into uncached and vice versa.
>> Wow, I found the kvm_arch_register_noncoherent_dma() stuff.  That's horrifying.  What's it for?  e
> 
> e.g. if you want to run a GPU it really needs some uncached memory. Same is true for other more 
> complex devices.
> 
> Now modern Linux of course will be preferring CLFLUSH instead for the conversion, but there are old 
> versions that preferred WBINVD.
> 
> I don't think it's a DoS, as long as you're not too picky about latencies on the host.
> 
> -Andi
> 

Currently we use prot_guest_has(PR_GUEST_DISABLE_WBINVD)) check for disabling the wbinvd()
usage (which can be selectively enabled for tested guests).

Is it alright to generalize it with boot_cpu_has(X86_FEATURE_HYPERVISOR) without
verify it?

> 
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-07 18:26                   ` Borislav Petkov
@ 2021-06-09 14:01                     ` Kuppuswamy, Sathyanarayanan
  2021-06-09 14:32                       ` Borislav Petkov
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-09 14:01 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Tom Lendacky



On 6/7/21 11:26 AM, Borislav Petkov wrote:
>> This header only exists in x86 arch code. So it is better to protect
>> it with x86 specific header file.
> That doesn't sound like a special reason to me. And compilers are
> usually very able at discarding unused symbols so I don't see a problem
> with keeping all includes at the top, like it is usually done.

I am still not clear. What happens when a driver which includes
linux/protected-guest.h is compiled for non-x86 arch (s390 or arm64)?

Since asm/sev.h and asm/tdx.h exists only in x86_64 arch, IMO, it
should be placed under CONFIG_INTEL_TDX_GUEST or CONFIG_AMD_MEM_ENCRYPT

did I miss anything?

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09  1:10                                                           ` [RFC v2-fix-v4 " Kuppuswamy Sathyanarayanan
  2021-06-09  3:40                                                             ` Dan Williams
@ 2021-06-09 14:12                                                             ` Dave Hansen
  1 sibling, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-06-09 14:12 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 6/8/21 6:10 PM, Kuppuswamy Sathyanarayanan wrote:
> Since cache is always coherent in TDX guests, making wbinvd as
> noop should not cause any issues in above mentioned code path.
> The end-behavior is the same as KVM guest (treat as noops).

I don't see anything in the specs to back up such a broad statement.

For Secure-EPT, I see in the TDX "EAS" that "Ignore PAT" is "Set to 1".
 This, presumably along with the "TD VMCS Guest MSRs... IA32_PAT" being
set to 0x0007040600070406 (I didn't decode it, I'm just guessing),
ensures that guests using Secure-EPT have no architectural way of
creating non-coherent mappings using the guest x86 page tables.

That covers one of the memory types to which guests have access.

Guests can also access TD-shared memory.  Those mappings are controlled
by the VMM and not mapped by Secure-EPT.  This is the part that concerns
me and is not consistent with the statement above.  Is it
architecturally impossible for a VMM to create an non-coherent mapping
and expose it to a guest?  If it is impossible, please include citations
of the spec or the logic behind this so that a reader can understand,
just as I did above.

If it is possible to have non-coherent mappings in a guest, then please
remove the above statement.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-09 14:01                     ` Kuppuswamy, Sathyanarayanan
@ 2021-06-09 14:32                       ` Borislav Petkov
  2021-06-09 14:56                         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-06-09 14:32 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Tom Lendacky

On Wed, Jun 09, 2021 at 07:01:13AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> I am still not clear. What happens when a driver which includes
> linux/protected-guest.h is compiled for non-x86 arch (s390 or arm64)?

I was wondering what felt weird: why is prot{ected,}_guest_has() in a
generic linux/ namespace header and not in an asm/ one?

I think the proper way is for the other arches should be to provide
their own prot_guest_has() implementation which generic code uses and
the generic header would contain only the PR_GUEST_* defines.

Take ioremap() as an example:

arch/x86/include/asm/io.h
arch/arm64/include/asm/io.h
arch/s390/include/asm/io.h
...

and pretty much every arch has that arch-specific io.h header which
defines ioremap() and generic code includes include/linux/io.h which
includes the respective asm/io.h header so that users can call the
respective ioremap() implementation.

prot_guest_has() sounds just the same to me.

Better?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-09 14:32                       ` Borislav Petkov
@ 2021-06-09 14:56                         ` Kuppuswamy, Sathyanarayanan
  2021-06-09 15:01                           ` Borislav Petkov
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-09 14:56 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Tom Lendacky



On 6/9/21 7:32 AM, Borislav Petkov wrote:
> On Wed, Jun 09, 2021 at 07:01:13AM -0700, Kuppuswamy, Sathyanarayanan wrote:
>> I am still not clear. What happens when a driver which includes
>> linux/protected-guest.h is compiled for non-x86 arch (s390 or arm64)?
> 
> I was wondering what felt weird: why is prot{ected,}_guest_has() in a
> generic linux/ namespace header and not in an asm/ one?
> 
> I think the proper way is for the other arches should be to provide
> their own prot_guest_has() implementation which generic code uses and
> the generic header would contain only the PR_GUEST_* defines.
> 
> Take ioremap() as an example:
> 
> arch/x86/include/asm/io.h
> arch/arm64/include/asm/io.h
> arch/s390/include/asm/io.h
> ...
> 
> and pretty much every arch has that arch-specific io.h header which
> defines ioremap() and generic code includes include/linux/io.h which
> includes the respective asm/io.h header so that users can call the
> respective ioremap() implementation.
> 
> prot_guest_has() sounds just the same to me.

ioremap() is required for all architectures. So I think adding support for it
and creating io.h for every arch seems valid. But are you sure every arch cares
about protected guest support?

IMHO, its better to leave it to arch maintainers to decide if they want
to support protected guest or not.

This can be easily achieved by defining generic arch independent config
option ARCH_HAS_PORTECTED_GUEST.

And any arch which wants to support prot_guest_has() can enable above
config option and create their own asm/protected_guest.h

This model is similar to linux/mem_encrypt.h.

With above suggested change, header file will look like below. And we
don't need implement asm/protected_guest.h for every available arch.

--- a/include/linux/protected_guest.h
+++ b/include/linux/protected_guest.h

#ifndef _LINUX_PROTECTED_GUEST_H
#define _LINUX_PROTECTED_GUEST_H 1

/* Protected Guest Feature Flags (leave 0-0xff for arch specific flags) */

/* Support for guest encryption */
#define PR_GUEST_MEM_ENCRYPT			0x100
/* Encryption support is active */
#define PR_GUEST_MEM_ENCRYPT_ACTIVE		0x101
/* Support for unrolled string IO */
#define PR_GUEST_UNROLL_STRING_IO		0x102
/* Support for host memory encryption */
#define PR_GUEST_HOST_MEM_ENCRYPT		0x103
/* Support for shared mapping initialization (after early init) */
#define PR_GUEST_SHARED_MAPPING_INIT		0x104

#ifdef ARCH_HAS_PROTECTED_GUEST
#include <asm/protected_guest.h>
#else
static inline bool prot_guest_has(unsigned long flag) { return false; }
#endif

#endif /* _LINUX_PROTECTED_GUEST_H */


> 
> Better?
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction
  2021-06-09 14:56                         ` Kuppuswamy, Sathyanarayanan
@ 2021-06-09 15:01                           ` Borislav Petkov
  2021-06-09 19:41                             ` [RFC v2-fix-v4 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-06-09 15:01 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Tom Lendacky

On Wed, Jun 09, 2021 at 07:56:14AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> And any arch which wants to support prot_guest_has() can enable above
> config option and create their own asm/protected_guest.

I wouldnt've done even that but only the x86 asm version of
protected_guest.h and left it to other arches to extend it. I don't
like "preempting" use of functionality by other arches and would
leave them to extend stuff themselves, as they see fit, but ok,
ARCH_HAS_PROTECTED_GUEST sounds clean enough to me too, so sure, that's
fine too.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09  4:27                                                                   ` Andi Kleen
@ 2021-06-09 15:09                                                                     ` Dan Williams
  2021-06-09 16:12                                                                       ` Andy Lutomirski
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-06-09 15:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Tue, Jun 8, 2021 at 9:27 PM Andi Kleen <ak@linux.intel.com> wrote:
>
>
> here is no resume path.
>
> > Host is free to go into S3 independent of any guest state.
>
> Actually my understanding is that none of the systems which support TDX
> support S3. S3 has been deprecated for a long time.

Ok, I wanted to imply any power state that might power-off caches.

>
>
> >   A hostile
> > host is free to do just enough cache management so that it can resume
> > from S3 while arranging for TDX guest dirty data to be lost. Does a
> > TDX guest go fatal if the cache loses power?
>
> That would be a machine check, and yes it would be fatal.

Sounds good, so incorporating this and Andy's feedback:

"TDX guests, like other typical guests, use standard ACPI mechanisms
to signal sleep state entry (including reboot) to the host. The ACPI
specification mandates WBINVD on any sleep state entry with the
expectation that the platform is only responsible for maintaining the
state of memory over sleep states, not preserving dirty data in any
CPU caches. ACPI cache flushing requirements pre-date the advent of
virtualization. Given guest sleep state entry does not affect any host
power rails it is not required to flush caches. The host is
responsible for maintaining cache state over its own bare metal sleep
state transitions that power-off the cache. A TDX guest, unlike a
typical guest, will machine check if the CPU cache is powered off."

Andi, is that machine check behavior relative to power states
mentioned in the docs?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09 15:09                                                                     ` Dan Williams
@ 2021-06-09 16:12                                                                       ` Andy Lutomirski
  2021-06-09 17:28                                                                         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Andy Lutomirski @ 2021-06-09 16:12 UTC (permalink / raw)
  To: Dan Williams, Andi Kleen
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Dave Hansen,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On 6/9/21 8:09 AM, Dan Williams wrote:
> On Tue, Jun 8, 2021 at 9:27 PM Andi Kleen <ak@linux.intel.com> wrote:
>>
>>
>> here is no resume path.
>>
>>> Host is free to go into S3 independent of any guest state.
>>
>> Actually my understanding is that none of the systems which support TDX
>> support S3. S3 has been deprecated for a long time.
> 
> Ok, I wanted to imply any power state that might power-off caches.
> 
>>
>>
>>>   A hostile
>>> host is free to do just enough cache management so that it can resume
>>> from S3 while arranging for TDX guest dirty data to be lost. Does a
>>> TDX guest go fatal if the cache loses power?
>>
>> That would be a machine check, and yes it would be fatal.
> 
> Sounds good, so incorporating this and Andy's feedback:
> 
> "TDX guests, like other typical guests, use standard ACPI mechanisms
> to signal sleep state entry (including reboot) to the host. The ACPI
> specification mandates WBINVD on any sleep state entry with the
> expectation that the platform is only responsible for maintaining the
> state of memory over sleep states, not preserving dirty data in any
> CPU caches. ACPI cache flushing requirements pre-date the advent of
> virtualization. Given guest sleep state entry does not affect any host
> power rails it is not required to flush caches. The host is
> responsible for maintaining cache state over its own bare metal sleep
> state transitions that power-off the cache. A TDX guest, unlike a
> typical guest, will machine check if the CPU cache is powered off."
> 
> Andi, is that machine check behavior relative to power states
> mentioned in the docs?

I don't think there's anything about power states.  There is a general
documented mechanism to integrity-check TD guest memory, but it is *not*
replay-resistant.  So, if the guest dirties a cache line, and the cache
line is lost, it seems entirely plausible that the guest would get
silently corrupted.

I would argue that, if this happens, it's a host, TD module, or
architecture bug, and it's not the guest's fault.

--Andy

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09 16:12                                                                       ` Andy Lutomirski
@ 2021-06-09 17:28                                                                         ` Kuppuswamy, Sathyanarayanan
  2021-06-09 17:31                                                                           ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-09 17:28 UTC (permalink / raw)
  To: Andy Lutomirski, Dan Williams, Andi Kleen
  Cc: Peter Zijlstra, Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List



On 6/9/21 9:12 AM, Andy Lutomirski wrote:
> On 6/9/21 8:09 AM, Dan Williams wrote:
>> On Tue, Jun 8, 2021 at 9:27 PM Andi Kleen <ak@linux.intel.com> wrote:
>>>
>>>
>>> here is no resume path.
>>>
>>>> Host is free to go into S3 independent of any guest state.
>>>
>>> Actually my understanding is that none of the systems which support TDX
>>> support S3. S3 has been deprecated for a long time.
>>
>> Ok, I wanted to imply any power state that might power-off caches.
>>
>>>
>>>
>>>>    A hostile
>>>> host is free to do just enough cache management so that it can resume
>>>> from S3 while arranging for TDX guest dirty data to be lost. Does a
>>>> TDX guest go fatal if the cache loses power?
>>>
>>> That would be a machine check, and yes it would be fatal.
>>
>> Sounds good, so incorporating this and Andy's feedback:
>>
>> "TDX guests, like other typical guests, use standard ACPI mechanisms
>> to signal sleep state entry (including reboot) to the host. The ACPI
>> specification mandates WBINVD on any sleep state entry with the
>> expectation that the platform is only responsible for maintaining the
>> state of memory over sleep states, not preserving dirty data in any
>> CPU caches. ACPI cache flushing requirements pre-date the advent of
>> virtualization. Given guest sleep state entry does not affect any host
>> power rails it is not required to flush caches. The host is
>> responsible for maintaining cache state over its own bare metal sleep
>> state transitions that power-off the cache. A TDX guest, unlike a
>> typical guest, will machine check if the CPU cache is powered off."
>>
>> Andi, is that machine check behavior relative to power states
>> mentioned in the docs?
> 
> I don't think there's anything about power states.  There is a general
> documented mechanism to integrity-check TD guest memory, but it is *not*
> replay-resistant.  So, if the guest dirties a cache line, and the cache
> line is lost, it seems entirely plausible that the guest would get
> silently corrupted.
> 
> I would argue that, if this happens, it's a host, TD module, or
> architecture bug, and it's not the guest's fault.

If you want to apply this fix for all hypervisors (using boot_cpu_has
(X86_FEATURE_HYPERVISOR) check), then we don't need any TDX specific
reference in commit log right? It can be generalized for all VM guests.

agree?

> 
> --Andy
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09 17:28                                                                         ` Kuppuswamy, Sathyanarayanan
@ 2021-06-09 17:31                                                                           ` Dan Williams
  2021-06-09 18:24                                                                             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-06-09 17:31 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Andy Lutomirski, Andi Kleen, Peter Zijlstra, Dave Hansen,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Wed, Jun 9, 2021 at 10:28 AM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
>
>
> On 6/9/21 9:12 AM, Andy Lutomirski wrote:
> > On 6/9/21 8:09 AM, Dan Williams wrote:
> >> On Tue, Jun 8, 2021 at 9:27 PM Andi Kleen <ak@linux.intel.com> wrote:
> >>>
> >>>
> >>> here is no resume path.
> >>>
> >>>> Host is free to go into S3 independent of any guest state.
> >>>
> >>> Actually my understanding is that none of the systems which support TDX
> >>> support S3. S3 has been deprecated for a long time.
> >>
> >> Ok, I wanted to imply any power state that might power-off caches.
> >>
> >>>
> >>>
> >>>>    A hostile
> >>>> host is free to do just enough cache management so that it can resume
> >>>> from S3 while arranging for TDX guest dirty data to be lost. Does a
> >>>> TDX guest go fatal if the cache loses power?
> >>>
> >>> That would be a machine check, and yes it would be fatal.
> >>
> >> Sounds good, so incorporating this and Andy's feedback:
> >>
> >> "TDX guests, like other typical guests, use standard ACPI mechanisms
> >> to signal sleep state entry (including reboot) to the host. The ACPI
> >> specification mandates WBINVD on any sleep state entry with the
> >> expectation that the platform is only responsible for maintaining the
> >> state of memory over sleep states, not preserving dirty data in any
> >> CPU caches. ACPI cache flushing requirements pre-date the advent of
> >> virtualization. Given guest sleep state entry does not affect any host
> >> power rails it is not required to flush caches. The host is
> >> responsible for maintaining cache state over its own bare metal sleep
> >> state transitions that power-off the cache. A TDX guest, unlike a
> >> typical guest, will machine check if the CPU cache is powered off."
> >>
> >> Andi, is that machine check behavior relative to power states
> >> mentioned in the docs?
> >
> > I don't think there's anything about power states.  There is a general
> > documented mechanism to integrity-check TD guest memory, but it is *not*
> > replay-resistant.  So, if the guest dirties a cache line, and the cache
> > line is lost, it seems entirely plausible that the guest would get
> > silently corrupted.
> >
> > I would argue that, if this happens, it's a host, TD module, or
> > architecture bug, and it's not the guest's fault.
>
> If you want to apply this fix for all hypervisors (using boot_cpu_has
> (X86_FEATURE_HYPERVISOR) check), then we don't need any TDX specific
> reference in commit log right? It can be generalized for all VM guests.
>
> agree?

No, because there is a note needed about the integrity implications in
the TDX case that makes it distinct from typical hypervisor enabling.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest
  2021-06-09 17:31                                                                           ` Dan Williams
@ 2021-06-09 18:24                                                                             ` Kuppuswamy, Sathyanarayanan
  2021-06-09 19:49                                                                               ` [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-09 18:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Lutomirski, Andi Kleen, Peter Zijlstra, Dave Hansen,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List



On 6/9/21 10:31 AM, Dan Williams wrote:
>> If you want to apply this fix for all hypervisors (using boot_cpu_has
>> (X86_FEATURE_HYPERVISOR) check), then we don't need any TDX specific
>> reference in commit log right? It can be generalized for all VM guests.
>>
>> agree?
> No, because there is a note needed about the integrity implications in
> the TDX case that makes it distinct from typical hypervisor enabling.

Generalized the commit log (but left the TDX related info). Final version
will look like below.

x86: Skip WBINVD instruction for VM guest

VM guests that supports ACPI, use standard ACPI mechanisms to signal sleep
state entry (including reboot) to the host. The ACPI specification mandates
WBINVD on any sleep state entry with the expectation that the platform is
only responsible for maintaining the state of memory over sleep states, not
preserving dirty data in any CPU caches. ACPI cache flushing requirements
pre-date the advent of virtualization. Given guest sleep state entry does not
affect any host power rails it is not required to flush caches. The host is
responsible for maintaining cache state over its own bare metal sleep state
transitions that power-off the cache. A TDX guest, unlike a typical guest,
will machine check if the CPU cache is powered off.

--- a/arch/x86/include/asm/acenv.h
+++ b/arch/x86/include/asm/acenv.h
@@ -10,10 +10,15 @@
  #define _ASM_X86_ACENV_H

  #include <asm/special_insns.h>
+#include <asm/cpu.h>

  /* Asm macros */

-#define ACPI_FLUSH_CPU_CACHE() wbinvd()
+#define ACPI_FLUSH_CPU_CACHE()                         \
+do {                                                   \
+       if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))      \
+               wbinvd();                               \
+} while (0)

  int __acpi_acquire_global_lock(unsigned int *lock);
  int __acpi_release_global_lock(unsigned int *lock);

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v4 1/1] x86: Introduce generic protected guest abstraction
  2021-06-09 15:01                           ` Borislav Petkov
@ 2021-06-09 19:41                             ` Kuppuswamy Sathyanarayanan
  2021-06-09 22:53                               ` Sathyanarayanan Kuppuswamy Natarajan
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-09 19:41 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Borislav Petkov
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kuppuswamy Sathyanarayanan

Add a generic way to check if we run with an encrypted guest,
without requiring x86 specific ifdefs. This can then be used in
non architecture specific code. 

prot_guest_has() is used to check for protected guest feature
flags.

Originally-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
Changes since RFC v2-fix-v3:
 * Introduced ARCH_HAS_PROTECTED_GUEST and moved arch specific checks to
   asm/protected_guest.h

Changes since RFC v2-fix-v2:
 * Renamed protected_guest_has() to prot_guest_has().
 * Changed flag prefix from VM_ to PR_GUEST_
 * Merged Borislav AMD implementation fix.

 arch/Kconfig                           |  3 +++
 arch/x86/Kconfig                       |  2 ++
 arch/x86/include/asm/protected_guest.h | 20 ++++++++++++++++++++
 arch/x86/include/asm/sev.h             |  3 +++
 arch/x86/include/asm/tdx.h             |  7 +++++++
 arch/x86/kernel/sev.c                  | 15 +++++++++++++++
 arch/x86/kernel/tdx.c                  | 15 +++++++++++++++
 arch/x86/mm/mem_encrypt.c              |  1 +
 include/linux/protected_guest.h        | 24 ++++++++++++++++++++++++
 9 files changed, 90 insertions(+)
 create mode 100644 arch/x86/include/asm/protected_guest.h
 create mode 100644 include/linux/protected_guest.h

diff --git a/arch/Kconfig b/arch/Kconfig
index c45b770d3579..3c5bf55ee752 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1011,6 +1011,9 @@ config HAVE_ARCH_NVRAM_OPS
 config ISA_BUS_API
 	def_bool ISA
 
+config ARCH_HAS_PROTECTED_GUEST
+	bool
+
 #
 # ABI hall of shame
 #
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a99adc683db9..fc51579e54ad 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -883,6 +883,7 @@ config INTEL_TDX_GUEST
 	select PARAVIRT_XL
 	select X86_X2APIC
 	select SECURITY_LOCKDOWN_LSM
+	select ARCH_HAS_PROTECTED_GUEST
 	help
 	  Provide support for running in a trusted domain on Intel processors
 	  equipped with Trusted Domain eXtenstions. TDX is a new Intel
@@ -1544,6 +1545,7 @@ config AMD_MEM_ENCRYPT
 	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
 	select INSTRUCTION_DECODER
 	select ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS
+	select ARCH_HAS_PROTECTED_GUEST
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/include/asm/protected_guest.h b/arch/x86/include/asm/protected_guest.h
new file mode 100644
index 000000000000..137976ef894a
--- /dev/null
+++ b/arch/x86/include/asm/protected_guest.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_PROTECTED_GUEST_H
+#define _ASM_PROTECTED_GUEST_H 1
+
+#include <asm/processor.h>
+#include <asm/tdx.h>
+#include <asm/sev.h>
+
+static inline bool prot_guest_has(unsigned long flag)
+{
+	if (is_tdx_guest())
+		return tdx_protected_guest_has(flag);
+	else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
+		return sev_protected_guest_has(flag);
+
+	return false;
+}
+
+#endif /* _ASM_PROTECTED_GUEST_H */
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index fa5cd05d3b5b..e9b0b93a3157 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -81,12 +81,15 @@ static __always_inline void sev_es_nmi_complete(void)
 		__sev_es_nmi_complete();
 }
 extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
+bool sev_protected_guest_has(unsigned long flag);
+
 #else
 static inline void sev_es_ist_enter(struct pt_regs *regs) { }
 static inline void sev_es_ist_exit(void) { }
 static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
 static inline void sev_es_nmi_complete(void) { }
 static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
+static inline bool sev_protected_guest_has(unsigned long flag) { return false; }
 #endif
 
 #endif
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f0c1912837c8..cbfe7479f2a3 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -71,6 +71,8 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
 		    struct tdx_hypercall_output *out);
 
+bool tdx_protected_guest_has(unsigned long flag);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
@@ -80,6 +82,11 @@ static inline bool is_tdx_guest(void)
 
 static inline void tdx_early_init(void) { };
 
+static inline bool tdx_protected_guest_has(unsigned long flag)
+{
+	return false;
+}
+
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #ifdef CONFIG_INTEL_TDX_GUEST_KVM
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 651b81cd648e..16e5c5f25e6f 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -19,6 +19,7 @@
 #include <linux/memblock.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/protected_guest.h>
 
 #include <asm/cpu_entry_area.h>
 #include <asm/stacktrace.h>
@@ -1493,3 +1494,17 @@ bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
 	while (true)
 		halt();
 }
+
+bool sev_protected_guest_has(unsigned long flag)
+{
+	switch (flag) {
+	case PR_GUEST_MEM_ENCRYPT:
+	case PR_GUEST_MEM_ENCRYPT_ACTIVE:
+	case PR_GUEST_UNROLL_STRING_IO:
+	case PR_GUEST_HOST_MEM_ENCRYPT:
+		return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(sev_protected_guest_has);
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 17725646eb30..111f15c05e24 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,7 @@
 #include <asm/vmx.h>
 
 #include <linux/cpu.h>
+#include <linux/protected_guest.h>
 
 /* TDX Module call Leaf IDs */
 #define TDINFO				1
@@ -75,6 +76,20 @@ bool is_tdx_guest(void)
 }
 EXPORT_SYMBOL_GPL(is_tdx_guest);
 
+bool tdx_protected_guest_has(unsigned long flag)
+{
+	switch (flag) {
+	case PR_GUEST_MEM_ENCRYPT:
+	case PR_GUEST_MEM_ENCRYPT_ACTIVE:
+	case PR_GUEST_UNROLL_STRING_IO:
+	case PR_GUEST_SHARED_MAPPING_INIT:
+		return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(tdx_protected_guest_has);
+
 static void tdg_get_info(void)
 {
 	u64 ret;
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index ff08dc463634..d0026bce47df 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -20,6 +20,7 @@
 #include <linux/bitops.h>
 #include <linux/dma-mapping.h>
 #include <linux/virtio_config.h>
+#include <linux/protected_guest.h>
 
 #include <asm/tlbflush.h>
 #include <asm/fixmap.h>
diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
new file mode 100644
index 000000000000..0facb8547217
--- /dev/null
+++ b/include/linux/protected_guest.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_PROTECTED_GUEST_H
+#define _LINUX_PROTECTED_GUEST_H 1
+
+/* Protected Guest Feature Flags (leave 0-0xff for arch specific flags) */
+
+/* Support for guest encryption */
+#define PR_GUEST_MEM_ENCRYPT			0x100
+/* Encryption support is active */
+#define PR_GUEST_MEM_ENCRYPT_ACTIVE		0x101
+/* Support for unrolled string IO */
+#define PR_GUEST_UNROLL_STRING_IO		0x102
+/* Support for host memory encryption */
+#define PR_GUEST_HOST_MEM_ENCRYPT		0x103
+/* Support for shared mapping initialization (after early init) */
+#define PR_GUEST_SHARED_MAPPING_INIT		0x104
+
+#ifdef CONFIG_ARCH_HAS_PROTECTED_GUEST
+#include <asm/protected_guest.h>
+#else
+static inline bool prot_guest_has(unsigned long flag) { return false; }
+#endif
+
+#endif /* _LINUX_PROTECTED_GUEST_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest
  2021-06-09 18:24                                                                             ` Kuppuswamy, Sathyanarayanan
@ 2021-06-09 19:49                                                                               ` Kuppuswamy Sathyanarayanan
  2021-06-09 19:56                                                                                 ` Dan Williams
  2021-06-09 21:03                                                                                 ` Dave Hansen
  0 siblings, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-06-09 19:49 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

VM guests that supports ACPI, use standard ACPI mechanisms to
signal sleep state entry (including reboot) to the host. The
ACPI specification mandates WBINVD on any sleep state entry
with the expectation that the platform is only responsible for
maintaining the state of memory over sleep states, not
preserving dirty data in any CPU caches. ACPI cache flushing
requirements pre-date the advent of virtualization. Given guest
sleep state entry does not affect any host power rails it is not
required to flush caches. The host is responsible for maintaining
cache state over its own bare metal sleep state transitions that
power-off the cache. A TDX guest, unlike a typical guest, will
machine check if the CPU cache is powered off.
   
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-fix-v4:
 * Fixed commit log as per Dan's comments.
 * Used boot_cpu_has(X86_FEATURE_HYPERVISOR) instead of
   prot_guest_has(PR_GUEST_DISABLE_WBINVD) check.
   
Changes since RFC v2-fix-v3:
 * Fixed commit log as per review comments.
 * Instead of fixing all usages of ACPI_FLUSH_CPU_CACHE(),
   created TDX specific exception for it in its implementation.

Changes since RFC v2-fix-v2:
 * Instead of handling WBINVD #VE exception as nop, we skip its
   usage in currently enabled drivers.
 * Adapted commit log for above change.

 arch/x86/include/asm/acenv.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/acenv.h b/arch/x86/include/asm/acenv.h
index 9aff97f0de7f..d4162e94bee8 100644
--- a/arch/x86/include/asm/acenv.h
+++ b/arch/x86/include/asm/acenv.h
@@ -10,10 +10,15 @@
 #define _ASM_X86_ACENV_H
 
 #include <asm/special_insns.h>
+#include <asm/cpu.h>
 
 /* Asm macros */
 
-#define ACPI_FLUSH_CPU_CACHE()	wbinvd()
+#define ACPI_FLUSH_CPU_CACHE()				\
+do {							\
+	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))	\
+		wbinvd();				\
+} while (0)
 
 int __acpi_acquire_global_lock(unsigned int *lock);
 int __acpi_release_global_lock(unsigned int *lock);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest
  2021-06-09 19:49                                                                               ` [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest Kuppuswamy Sathyanarayanan
@ 2021-06-09 19:56                                                                                 ` Dan Williams
  2021-06-09 21:03                                                                                 ` Dave Hansen
  1 sibling, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-06-09 19:56 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Linux ACPI, Rafael J. Wysocki

[ add back linux-acpi and Rafael ]


On Wed, Jun 9, 2021 at 12:49 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> VM guests that supports ACPI, use standard ACPI mechanisms to
> signal sleep state entry (including reboot) to the host. The
> ACPI specification mandates WBINVD on any sleep state entry
> with the expectation that the platform is only responsible for
> maintaining the state of memory over sleep states, not
> preserving dirty data in any CPU caches. ACPI cache flushing
> requirements pre-date the advent of virtualization. Given guest
> sleep state entry does not affect any host power rails it is not
> required to flush caches. The host is responsible for maintaining
> cache state over its own bare metal sleep state transitions that
> power-off the cache. A TDX guest, unlike a typical guest, will
> machine check if the CPU cache is powered off.

Looks like you are wrapping at column 62 than 72, double check that
for the final submission of this series. Other than that this looks
good to me.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest
  2021-06-09 19:49                                                                               ` [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest Kuppuswamy Sathyanarayanan
  2021-06-09 19:56                                                                                 ` Dan Williams
@ 2021-06-09 21:03                                                                                 ` Dave Hansen
  2021-06-09 21:38                                                                                   ` Dan Williams
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-06-09 21:03 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Dan Williams
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

This changelog lacks both clear problem statements and a clear solution
implemented within the patch.

Here's a proposed changelog.  It clearly spells out the two problems
caused by WBINVD within a guest, and the proposed solution which fixes
those two problems.

Is this missing anything?

--

VM guests that support ACPI use standard ACPI mechanisms to signal sleep
state entry to the host.  To ACPI, reboot is simply another sleep state.

ACPI specifies that the platform preserve memory contents over (some)
sleep states.  It does not specify any requirements for data
preservation in CPU caches.  The ACPI specification mandates the use of
WBINVD to flush the contents of the CPU caches to memory before entering
specific sleep states, thus ensuring data in caches can survive sleep
state transitions.e

Unlike when entering sleep states bare metal, no actions within a guest
can cause data in processor caches to be lost.  That makes these WBINVD
invocations harmless but superfluous within a guest. (<--- problem #1)

In TDX guests, these WBINVD operations cause #VE exceptions.  For debug,
it would be ideal for the #VE handler to be able to WARN() when an
unexpected WBINVD occurs. (<--- problem #2)

Avoid WBINVD for all ACPI cache-flushing operations which occur while
running under a hypervisor, which includes TDX guests.  This both avoids
TDX warnings and optimizes away superfluous WBINVD invocations. (<----
solution)


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest
  2021-06-09 21:03                                                                                 ` Dave Hansen
@ 2021-06-09 21:38                                                                                   ` Dan Williams
  2021-06-09 21:42                                                                                     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-06-09 21:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Wed, Jun 9, 2021 at 2:03 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> This changelog lacks both clear problem statements and a clear solution
> implemented within the patch.
>
> Here's a proposed changelog.  It clearly spells out the two problems
> caused by WBINVD within a guest, and the proposed solution which fixes
> those two problems.

Looks good to me modulo the comment below...

>
> Is this missing anything?
>
> --
>
> VM guests that support ACPI use standard ACPI mechanisms to signal sleep
> state entry to the host.  To ACPI, reboot is simply another sleep state.
>
> ACPI specifies that the platform preserve memory contents over (some)
> sleep states.  It does not specify any requirements for data
> preservation in CPU caches.  The ACPI specification mandates the use of
> WBINVD to flush the contents of the CPU caches to memory before entering
> specific sleep states, thus ensuring data in caches can survive sleep
> state transitions.e
>
> Unlike when entering sleep states bare metal, no actions within a guest
> can cause data in processor caches to be lost.  That makes these WBINVD
> invocations harmless but superfluous within a guest. (<--- problem #1)
>
> In TDX guests, these WBINVD operations cause #VE exceptions.  For debug,
> it would be ideal for the #VE handler to be able to WARN() when an
> unexpected WBINVD occurs. (<--- problem #2)

...but it doesn't WARN() it triggers unhandled #VE, unless I missed
another patch that precedes this that turns it into a WARN()? If a
code path expects WBINVD for correct operation and the guest can't
execute that sounds fatal, not a WARN to me.

> Avoid WBINVD for all ACPI cache-flushing operations which occur while
> running under a hypervisor, which includes TDX guests.  This both avoids
> TDX warnings and optimizes away superfluous WBINVD invocations. (<----
> solution)
>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest
  2021-06-09 21:38                                                                                   ` Dan Williams
@ 2021-06-09 21:42                                                                                     ` Kuppuswamy, Sathyanarayanan
  2021-06-09 23:55                                                                                       ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-06-09 21:42 UTC (permalink / raw)
  To: Dan Williams, Dave Hansen
  Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List



On 6/9/21 2:38 PM, Dan Williams wrote:
>> In TDX guests, these WBINVD operations cause #VE exceptions.  For debug,
>> it would be ideal for the #VE handler to be able to WARN() when an
>> unexpected WBINVD occurs. (<--- problem #2)
> ...but it doesn't WARN() it triggers unhandled #VE, unless I missed
> another patch that precedes this that turns it into a WARN()? If a
> code path expects WBINVD for correct operation and the guest can't
> execute that sounds fatal, not a WARN to me.

Yes. It is not WARN. It is a fatal unhandled exception.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v4 1/1] x86: Introduce generic protected guest abstraction
  2021-06-09 19:41                             ` [RFC v2-fix-v4 " Kuppuswamy Sathyanarayanan
@ 2021-06-09 22:53                               ` Sathyanarayanan Kuppuswamy Natarajan
  0 siblings, 0 replies; 381+ messages in thread
From: Sathyanarayanan Kuppuswamy Natarajan @ 2021-06-09 22:53 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Borislav Petkov,
	Tony Luck, Andi Kleen, Kirill Shutemov, Dan Williams, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List

Hi All,

On Wed, Jun 9, 2021 at 12:42 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> Add a generic way to check if we run with an encrypted guest,
> without requiring x86 specific ifdefs. This can then be used in
> non architecture specific code.
>
> prot_guest_has() is used to check for protected guest feature
> flags.
>
> Originally-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---

I have sent a non RFC version of this patch for x86 review. Please use
it for further
discussion.

https://lore.kernel.org/patchwork/patch/1444184/

> Changes since RFC v2-fix-v3:
>  * Introduced ARCH_HAS_PROTECTED_GUEST and moved arch specific checks to
>    asm/protected_guest.h
>
> Changes since RFC v2-fix-v2:
>  * Renamed protected_guest_has() to prot_guest_has().
>  * Changed flag prefix from VM_ to PR_GUEST_
>  * Merged Borislav AMD implementation fix.




-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest
  2021-06-09 21:42                                                                                     ` Kuppuswamy, Sathyanarayanan
@ 2021-06-09 23:55                                                                                       ` Dave Hansen
  0 siblings, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-06-09 23:55 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List

On 6/9/21 2:42 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 6/9/21 2:38 PM, Dan Williams wrote:
>>> In TDX guests, these WBINVD operations cause #VE exceptions.  For debug,
>>> it would be ideal for the #VE handler to be able to WARN() when an
>>> unexpected WBINVD occurs. (<--- problem #2)
>> ...but it doesn't WARN() it triggers unhandled #VE, unless I missed
>> another patch that precedes this that turns it into a WARN()? If a
>> code path expects WBINVD for correct operation and the guest can't
>> execute that sounds fatal, not a WARN to me.
> 
> Yes. It is not WARN. It is a fatal unhandled exception.

That makes the problem statement a wee bit different, but it should
still be pretty easy to explain:

	In TDX guests, these WBINVD operations cause #VE exceptions.
	While some #VE exceptions can be handled, there is no recourse
	for a TDX guest to handle a WBINVD and it will panic(). (<---
	problem #2)

^ permalink raw reply	[flat|nested] 381+ messages in thread

end of thread, other threads:[~2021-06-09 23:55 UTC | newest]

Thread overview: 381+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL Kuppuswamy Sathyanarayanan
2021-04-27 17:31   ` Borislav Petkov
2021-05-06 14:59     ` Kirill A. Shutemov
2021-05-10  8:07     ` Juergen Gross
2021-05-10 15:52       ` Andi Kleen
2021-05-10 15:56         ` Juergen Gross
2021-05-12 12:07           ` Kirill A. Shutemov
2021-05-12 13:18           ` Peter Zijlstra
2021-05-12 13:24             ` Andi Kleen
2021-05-12 13:51               ` Juergen Gross
2021-05-17 23:50                 ` [RFC v2-fix 1/1] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option Kuppuswamy Sathyanarayanan
2021-04-26 21:09   ` Randy Dunlap
2021-04-26 22:32     ` Kuppuswamy, Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 03/32] x86/cpufeatures: Add TDX Guest CPU feature Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 04/32] x86/x86: Add is_tdx_guest() interface Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions Kuppuswamy Sathyanarayanan
2021-04-26 20:32   ` Dave Hansen
2021-04-26 22:31     ` Kuppuswamy, Sathyanarayanan
2021-04-26 23:17       ` Dave Hansen
2021-04-27  2:29         ` Kuppuswamy, Sathyanarayanan
2021-04-27 14:29           ` Dave Hansen
2021-04-27 19:18             ` Kuppuswamy, Sathyanarayanan
2021-04-27 19:20               ` Dave Hansen
2021-04-28 17:42                 ` [PATCH v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() " Kuppuswamy Sathyanarayanan
2021-05-19  5:58                 ` [RFC v2-fix-v1 " Kuppuswamy Sathyanarayanan
2021-05-19  6:04                   ` Kuppuswamy, Sathyanarayanan
2021-05-19 15:31                   ` Dave Hansen
2021-05-19 19:09                     ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
2021-05-19 19:13                     ` [RFC v2-fix-v1 " Kuppuswamy, Sathyanarayanan
2021-05-19 20:09                       ` Sean Christopherson
2021-05-19 20:49                         ` Andi Kleen
2021-05-27  0:30                           ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
2021-05-27 15:25                             ` Luck, Tony
2021-05-27 15:52                               ` Kuppuswamy, Sathyanarayanan
2021-05-27 16:25                                 ` Luck, Tony
2021-04-26 18:01 ` [RFC v2 06/32] x86/tdx: Get TD execution environment information via TDINFO Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 07/32] x86/traps: Add do_general_protection() helper function Kuppuswamy Sathyanarayanan
2021-05-07 21:20   ` Dave Hansen
2021-04-26 18:01 ` [RFC v2 08/32] x86/traps: Add #VE support for TDX guest Kuppuswamy Sathyanarayanan
2021-05-07 21:36   ` Dave Hansen
2021-05-13 19:47     ` Andi Kleen
2021-05-13 20:07       ` Dave Hansen
2021-05-13 22:43         ` Andi Kleen
2021-05-13 20:14       ` Dave Hansen
2021-05-18  0:09     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
2021-05-18 15:11       ` Dave Hansen
2021-05-18 15:45         ` Andi Kleen
2021-05-18 15:56           ` Dave Hansen
2021-05-18 16:00             ` Andi Kleen
2021-05-21 19:22           ` Dan Williams
2021-05-24 14:02             ` Andi Kleen
2021-05-27  0:29               ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
2021-05-27 15:11                 ` Luck, Tony
2021-05-27 16:24                   ` Sean Christopherson
2021-05-27 16:36                     ` Dave Hansen
2021-05-21 18:45       ` [RFC v2-fix " Kuppuswamy, Sathyanarayanan
2021-05-21 19:15         ` Dave Hansen
2021-05-21 19:57           ` Kuppuswamy, Sathyanarayanan
2021-06-08 17:02   ` [RFC v2 08/32] " Dave Hansen
2021-06-08 17:48     ` Sean Christopherson
2021-06-08 17:53       ` Dave Hansen
2021-06-08 18:12         ` Andi Kleen
2021-06-08 18:15           ` Dave Hansen
2021-06-08 18:17             ` Andy Lutomirski
2021-06-08 18:18             ` Andi Kleen
2021-04-26 18:01 ` [RFC v2 09/32] x86/tdx: Add HLT " Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls Kuppuswamy Sathyanarayanan
2021-05-07 21:46   ` Dave Hansen
2021-05-08  0:59     ` Kuppuswamy, Sathyanarayanan
2021-05-12 13:00       ` Kirill A. Shutemov
2021-05-12 14:10         ` Kuppuswamy, Sathyanarayanan
2021-05-12 14:29           ` Dave Hansen
2021-05-13 19:29             ` Kuppuswamy, Sathyanarayanan
2021-05-13 19:33               ` Dave Hansen
2021-05-18  0:15                 ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
2021-05-18 15:51                   ` Dave Hansen
2021-05-18 16:23                     ` Sean Christopherson
2021-05-18 20:12                     ` Kuppuswamy, Sathyanarayanan
2021-05-18 20:19                       ` Dave Hansen
2021-05-18 20:57                         ` Kuppuswamy, Sathyanarayanan
2021-05-18 21:19                         ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
2021-05-18 23:29                           ` Dave Hansen
2021-05-19  1:17                             ` [RFC v2-fix-v3 " Kuppuswamy Sathyanarayanan
2021-05-19  1:20                               ` Sathyanarayanan Kuppuswamy Natarajan
2021-04-26 18:01 ` [RFC v2 11/32] x86/tdx: Add MSR support for TDX guest Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 12/32] x86/tdx: Handle CPUID via #VE Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 13/32] x86/io: Allow to override inX() and outX() implementation Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 14/32] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
2021-05-10 21:57   ` Dan Williams
2021-05-10 23:08     ` Andi Kleen
2021-05-10 23:34       ` Dan Williams
2021-05-11  0:01         ` Andi Kleen
2021-05-11  0:21           ` Dan Williams
2021-05-11  0:30         ` Kuppuswamy, Sathyanarayanan
2021-05-11  1:07           ` Dan Williams
2021-05-11  2:29             ` Kuppuswamy, Sathyanarayanan
2021-05-11 14:39               ` Dave Hansen
2021-05-11 15:08                 ` Kuppuswamy, Sathyanarayanan
2021-05-11  0:56         ` Kuppuswamy, Sathyanarayanan
2021-05-11  2:19           ` Andi Kleen
2021-05-11 15:35     ` Dave Hansen
2021-05-11 15:43       ` Dan Williams
2021-05-12  6:17       ` Dan Williams
2021-05-27  4:23         ` [RFC v2-fix-v1 0/3] " Kuppuswamy Sathyanarayanan
2021-05-27  4:23           ` [RFC v2-fix-v1 1/3] tdx: Introduce generic protected_guest abstraction Kuppuswamy Sathyanarayanan
2021-06-01 21:14             ` [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstraction Kuppuswamy Sathyanarayanan
2021-06-02 17:20               ` Sean Christopherson
2021-06-02 18:15                 ` Tom Lendacky
2021-06-02 18:25                   ` Kuppuswamy, Sathyanarayanan
2021-06-02 18:29                   ` Borislav Petkov
2021-06-02 18:32                     ` Kuppuswamy, Sathyanarayanan
2021-06-02 18:39                       ` Borislav Petkov
2021-06-02 18:45                         ` Kuppuswamy, Sathyanarayanan
2021-06-02 18:19               ` Tom Lendacky
2021-06-02 18:29                 ` Kuppuswamy, Sathyanarayanan
2021-06-02 18:30                 ` Borislav Petkov
2021-06-03 18:14               ` Borislav Petkov
2021-06-03 18:15                 ` [RFC v2-fix-v2 1/1] x86: Introduce generic protected guest abstractionn Borislav Petkov
2021-06-04 22:01                   ` Tom Lendacky
2021-06-04 22:13                     ` Kuppuswamy, Sathyanarayanan
2021-06-04 22:15                     ` Borislav Petkov
2021-06-04 23:31                       ` Tom Lendacky
2021-06-05 11:03                         ` Borislav Petkov
2021-06-05 18:12                           ` Kuppuswamy, Sathyanarayanan
2021-06-05 20:08                             ` Borislav Petkov
2021-06-07 19:55                   ` Kirill A. Shutemov
2021-06-07 20:14                     ` Borislav Petkov
2021-06-07 22:26                       ` Kuppuswamy, Sathyanarayanan
2021-06-08 21:30                         ` [RFC v2-fix-v3 1/1] x86: Introduce generic protected guest abstraction Kuppuswamy Sathyanarayanan
2021-06-03 18:33                 ` [RFC v2-fix-v2 " Kuppuswamy, Sathyanarayanan
2021-06-03 18:41                   ` Borislav Petkov
2021-06-03 18:54                     ` Kuppuswamy, Sathyanarayanan
2021-06-07 18:01                 ` Kuppuswamy, Sathyanarayanan
2021-06-07 18:26                   ` Borislav Petkov
2021-06-09 14:01                     ` Kuppuswamy, Sathyanarayanan
2021-06-09 14:32                       ` Borislav Petkov
2021-06-09 14:56                         ` Kuppuswamy, Sathyanarayanan
2021-06-09 15:01                           ` Borislav Petkov
2021-06-09 19:41                             ` [RFC v2-fix-v4 " Kuppuswamy Sathyanarayanan
2021-06-09 22:53                               ` Sathyanarayanan Kuppuswamy Natarajan
2021-05-27  4:23           ` [RFC v2-fix-v1 2/3] x86/tdx: Handle early IO operations Kuppuswamy Sathyanarayanan
2021-06-05  4:26             ` Williams, Dan J
2021-05-27  4:23           ` [RFC v2-fix-v1 3/3] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
2021-06-05 18:52             ` Dan Williams
2021-06-05 20:08               ` Kuppuswamy, Sathyanarayanan
2021-06-05 21:08                 ` Dan Williams
2021-06-07 16:24                   ` Kuppuswamy, Sathyanarayanan
2021-06-07 17:17                     ` Dan Williams
2021-06-07 21:52                       ` Kuppuswamy, Sathyanarayanan
2021-06-07 22:00                         ` Dan Williams
2021-06-08  2:57                           ` Andi Kleen
2021-06-08 15:40                       ` [RFC v2-fix-v2 0/3] " Kuppuswamy Sathyanarayanan
2021-06-08 15:40                         ` [RFC v2-fix-v2 1/3] x86/tdx: Handle port I/O in decompression code Kuppuswamy Sathyanarayanan
2021-06-08 23:12                           ` Dan Williams
2021-06-08 15:40                         ` [RFC v2-fix-v2 2/3] x86/tdx: Handle early IO operations Kuppuswamy Sathyanarayanan
2021-06-08 15:40                         ` [RFC v2-fix-v2 3/3] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
2021-06-08 16:26                           ` Dan Williams
2021-04-26 18:01 ` [RFC v2 15/32] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
2021-05-07 21:52   ` Dave Hansen
2021-05-18  0:48     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
2021-05-18 15:00       ` Dave Hansen
2021-05-18 15:56         ` Andi Kleen
2021-05-18 16:04           ` Dave Hansen
2021-05-18 16:10             ` Andi Kleen
2021-05-18 16:22               ` Dave Hansen
2021-05-18 17:05                 ` Andi Kleen
2021-05-18 17:28               ` Andi Kleen
2021-05-18 17:11           ` Sean Christopherson
2021-05-18 17:21             ` Andi Kleen
2021-05-18 17:46               ` Dave Hansen
2021-05-18 18:36                 ` Sean Christopherson
2021-05-18 20:20                 ` Andi Kleen
2021-05-18 20:40                   ` Dave Hansen
2021-05-18 21:05                     ` Andi Kleen
2021-05-18 18:22               ` Sean Christopherson
2021-05-18 20:28                 ` Andi Kleen
2021-05-18 20:37                   ` Sean Christopherson
2021-05-18 20:56                     ` Andi Kleen
2021-05-18 16:18         ` Sean Christopherson
2021-05-18 17:15           ` Andi Kleen
2021-05-18 18:17             ` Sean Christopherson
2021-05-20 22:47               ` Kirill A. Shutemov
2021-06-02 19:42     ` [RFC v2-fix-v2 0/2] " Kuppuswamy Sathyanarayanan
2021-06-02 19:42       ` [RFC v2-fix-v2 1/2] x86/sev-es: Abstract out MMIO instruction decoding Kuppuswamy Sathyanarayanan
2021-06-05 21:56         ` Dan Williams
2021-06-08 15:59           ` [RFC v2-fix-v3 0/4] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
2021-06-08 15:59             ` [RFC v2-fix-v3 1/4] x86/insn-eval: Introduce insn_get_modrm_reg_ptr() Kuppuswamy Sathyanarayanan
2021-06-08 15:59             ` [RFC v2-fix-v3 2/4] x86/insn-eval: Introduce insn_decode_mmio() Kuppuswamy Sathyanarayanan
2021-06-08 15:59             ` [RFC v2-fix-v3 3/4] x86/sev-es: Use insn_decode_mmio() for MMIO implementation Kuppuswamy Sathyanarayanan
2021-06-08 15:59             ` [RFC v2-fix-v3 4/4] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
2021-06-02 19:42       ` [RFC v2-fix-v2 2/2] " Kuppuswamy Sathyanarayanan
2021-06-02 21:01         ` Andi Kleen
2021-06-02 22:14           ` Kuppuswamy, Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Kuppuswamy Sathyanarayanan
2021-05-11  1:23   ` Dan Williams
2021-05-11  2:17     ` Andi Kleen
2021-05-11  2:44       ` Kuppuswamy, Sathyanarayanan
2021-05-11  2:51         ` Andi Kleen
2021-05-11 15:37       ` Dan Williams
2021-05-11 15:42         ` Andi Kleen
2021-05-11 15:44         ` Dave Hansen
2021-05-11 15:50           ` Dan Williams
2021-05-11 15:52             ` Andi Kleen
2021-05-11 16:04               ` Dave Hansen
2021-05-11 17:06                 ` Andi Kleen
2021-05-11 17:42                   ` Dave Hansen
2021-05-11 17:48                     ` Andi Kleen
2021-05-24 23:32                       ` [RFC v2-fix-v2 1/2] x86/tdx: Handle MWAIT and MONITOR Kuppuswamy Sathyanarayanan
2021-05-24 23:32                         ` [RFC v2-fix-v2 2/2] x86/tdx: Ignore WBINVD instruction for TDX guest Kuppuswamy Sathyanarayanan
2021-05-24 23:39                           ` Dan Williams
2021-05-25  0:29                             ` Kuppuswamy, Sathyanarayanan
2021-05-25  0:50                               ` Dan Williams
2021-05-25  0:54                                 ` Sean Christopherson
2021-05-25  1:02                                 ` Andi Kleen
2021-05-25  1:45                                   ` Dan Williams
2021-05-25  2:13                                     ` Andi Kleen
2021-05-25  2:49                                       ` Dan Williams
2021-05-25  3:27                                         ` Andi Kleen
2021-05-25  3:40                                           ` Dan Williams
2021-05-26  1:09                                             ` Andi Kleen
2021-05-27  4:38                                               ` [RFC v2-fix-v3 1/1] " Kuppuswamy Sathyanarayanan
2021-06-05  3:35                                                 ` Dan Williams
2021-06-08 21:35                                                   ` [RFC v2-fix-v3 1/1] x86/tdx: Skip " Kuppuswamy Sathyanarayanan
2021-06-08 21:41                                                     ` Dan Williams
2021-06-08 22:17                                                     ` Dave Hansen
2021-06-08 22:34                                                       ` Andi Kleen
2021-06-08 22:36                                                       ` Kuppuswamy, Sathyanarayanan
2021-06-08 22:53                                                         ` Dave Hansen
2021-06-08 23:04                                                           ` Andi Kleen
2021-06-08 23:04                                                           ` Kuppuswamy, Sathyanarayanan
2021-06-08 23:32                                                     ` Dan Williams
2021-06-08 23:38                                                       ` Dave Hansen
2021-06-09  0:07                                                         ` Dan Williams
2021-06-09  0:14                                                           ` Kuppuswamy, Sathyanarayanan
2021-06-09  1:10                                                           ` [RFC v2-fix-v4 " Kuppuswamy Sathyanarayanan
2021-06-09  3:40                                                             ` Dan Williams
2021-06-09  3:56                                                               ` Kuppuswamy, Sathyanarayanan
2021-06-09  4:19                                                                 ` Dan Williams
2021-06-09  4:27                                                                   ` Andi Kleen
2021-06-09 15:09                                                                     ` Dan Williams
2021-06-09 16:12                                                                       ` Andy Lutomirski
2021-06-09 17:28                                                                         ` Kuppuswamy, Sathyanarayanan
2021-06-09 17:31                                                                           ` Dan Williams
2021-06-09 18:24                                                                             ` Kuppuswamy, Sathyanarayanan
2021-06-09 19:49                                                                               ` [RFC v2-fix-v5 1/1] x86: Skip WBINVD instruction for VM guest Kuppuswamy Sathyanarayanan
2021-06-09 19:56                                                                                 ` Dan Williams
2021-06-09 21:03                                                                                 ` Dave Hansen
2021-06-09 21:38                                                                                   ` Dan Williams
2021-06-09 21:42                                                                                     ` Kuppuswamy, Sathyanarayanan
2021-06-09 23:55                                                                                       ` Dave Hansen
2021-06-09  4:02                                                               ` [RFC v2-fix-v4 1/1] x86/tdx: Skip WBINVD instruction for TDX guest Andy Lutomirski
2021-06-09  4:21                                                                 ` Dan Williams
2021-06-09  4:25                                                                 ` Andi Kleen
2021-06-09  4:32                                                                   ` Andy Lutomirski
2021-06-09  4:40                                                                     ` Andi Kleen
2021-06-09  4:54                                                                       ` Kuppuswamy, Sathyanarayanan
2021-06-09 14:12                                                             ` Dave Hansen
2021-05-25  4:32                                       ` [RFC v2-fix-v2 2/2] x86/tdx: Ignore " Dave Hansen
2021-05-25  0:36                             ` Andi Kleen
2021-05-24 23:42                           ` Dave Hansen
2021-05-25  0:39                             ` Andi Kleen
2021-05-25  0:53                               ` Dan Williams
2021-05-25  2:26                         ` [RFC v2-fix-v2 1/2] x86/tdx: Handle MWAIT and MONITOR Dan Williams
2021-05-11 14:08     ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Dave Hansen
2021-05-11 16:09       ` Sean Christopherson
2021-05-11 16:16         ` Dave Hansen
2021-05-11 15:53   ` Dave Hansen
2021-04-26 18:01 ` [RFC v2 17/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Structure Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 18/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Mailbox Structure Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 19/32] ACPI/table: Print MADT Wake table information Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 20/32] x86/acpi, x86/boot: Add multiprocessor wake-up support Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 21/32] x86/boot: Add a trampoline for APs booting in 64-bit mode Kuppuswamy Sathyanarayanan
2021-05-13  2:56   ` Dan Williams
2021-05-18  0:54     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
2021-05-18  2:06       ` Dan Williams
2021-05-18  2:53         ` Kuppuswamy, Sathyanarayanan
2021-05-18  4:08           ` Dan Williams
2021-05-20  0:18             ` Kuppuswamy, Sathyanarayanan
2021-05-20  0:40               ` Dan Williams
2021-05-20  0:42                 ` Kuppuswamy, Sathyanarayanan
2021-05-21 14:39                   ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
2021-05-21 18:29                     ` Dan Williams
2021-04-26 18:01 ` [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot for TDX platforms Kuppuswamy Sathyanarayanan
2021-05-13  3:03   ` Dan Williams
2021-04-26 18:01 ` [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process Kuppuswamy Sathyanarayanan
2021-05-13  3:23   ` Dan Williams
2021-05-18  0:59     ` [WARNING: UNSCANNABLE EXTRACTION FAILED][WARNING: UNSCANNABLE EXTRACTION FAILED][RFC v2-fix 1/1] x86/boot: Avoid #VE during boot for TDX platforms Kuppuswamy Sathyanarayanan
2021-05-19 16:53       ` [RFC " Dave Hansen
2021-05-21 14:35         ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
2021-05-21 16:11           ` Dave Hansen
2021-05-21 18:18             ` Sean Christopherson
2021-05-21 18:30               ` Dave Hansen
2021-05-21 18:32                 ` Kuppuswamy, Sathyanarayanan
2021-05-24 23:27                   ` [RFC v2-fix-v3 " Kuppuswamy Sathyanarayanan
2021-05-27 21:25                     ` [RFC v2-fix-v4 " Kuppuswamy Sathyanarayanan
2021-06-08 23:14                       ` Dan Williams
2021-05-21 18:31             ` [RFC v2-fix-v2 " Kuppuswamy, Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 24/32] x86/topology: Disable CPU online/offline control for TDX guest Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 25/32] x86/tdx: Forcefully disable legacy PIC for TDX guests Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code Kuppuswamy Sathyanarayanan
2021-05-07 21:54   ` Dave Hansen
2021-05-10 22:19     ` Kuppuswamy, Sathyanarayanan
2021-05-10 22:23       ` Dave Hansen
2021-05-12 13:08         ` Kirill A. Shutemov
2021-05-12 15:44           ` Dave Hansen
2021-05-12 15:53             ` Sean Christopherson
2021-05-13 16:40               ` Kuppuswamy, Sathyanarayanan
2021-05-13 17:49                 ` Dave Hansen
2021-05-13 18:17                   ` Kuppuswamy, Sathyanarayanan
2021-05-13 19:38                   ` Andi Kleen
2021-05-13 19:42                     ` Dave Hansen
2021-05-17 18:16                     ` Sean Christopherson
2021-05-17 18:27                       ` Kuppuswamy, Sathyanarayanan
2021-05-17 18:33                         ` Dave Hansen
2021-05-17 18:37                           ` Sean Christopherson
2021-05-17 22:32                             ` Kuppuswamy, Sathyanarayanan
2021-05-17 23:11                               ` Andi Kleen
2021-05-18  1:28             ` Kuppuswamy, Sathyanarayanan
2021-05-27  4:46               ` Kuppuswamy, Sathyanarayanan
2021-05-27  4:47                 ` [RFC v2-fix-v1 1/1] " Kuppuswamy Sathyanarayanan
2021-06-01  2:10                   ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Kuppuswamy Sathyanarayanan
2021-05-19  5:00   ` Kuppuswamy, Sathyanarayanan
2021-05-19 16:14   ` Dave Hansen
2021-05-20 18:48     ` Kuppuswamy, Sathyanarayanan
2021-05-20 18:56       ` Kuppuswamy, Sathyanarayanan
2021-05-20 19:33       ` Sean Christopherson
2021-05-20 19:42         ` Kuppuswamy, Sathyanarayanan
2021-05-20 20:16           ` Sean Christopherson
2021-05-20 20:31             ` Andi Kleen
2021-05-20 21:18               ` Sean Christopherson
2021-05-20 21:23                 ` Dave Hansen
2021-05-20 21:28                   ` Kuppuswamy, Sathyanarayanan
2021-05-20 23:25                     ` Andi Kleen
2021-05-20 20:56             ` Dave Hansen
2021-05-31 21:46               ` Kirill A. Shutemov
2021-06-01  2:08                 ` [RFC v2-fix-v1 1/1] x86/tdx: Exclude Shared bit from physical_mask Kuppuswamy Sathyanarayanan
2021-05-20 20:30       ` [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Dave Hansen
2021-04-26 18:01 ` [RFC v2 28/32] x86/tdx: Make pages shared in ioremap() Kuppuswamy Sathyanarayanan
2021-05-07 21:55   ` Dave Hansen
2021-05-07 22:38     ` Andi Kleen
2021-05-10 22:23       ` Kuppuswamy, Sathyanarayanan
2021-05-10 22:30         ` Dave Hansen
2021-05-10 22:52           ` Sean Christopherson
2021-05-11  9:35             ` Borislav Petkov
2021-05-20 20:12               ` Kuppuswamy, Sathyanarayanan
2021-05-21 15:18                 ` Borislav Petkov
2021-05-21 16:19                   ` Tom Lendacky
2021-05-21 18:49                     ` Borislav Petkov
2021-05-21 21:14                       ` Tom Lendacky
2021-05-25 18:21                         ` Kuppuswamy, Sathyanarayanan
2021-05-31 15:13                           ` Borislav Petkov
2021-05-31 17:32                             ` Kuppuswamy, Sathyanarayanan
2021-05-31 17:55                               ` Borislav Petkov
2021-05-31 18:45                                 ` Kuppuswamy, Sathyanarayanan
2021-05-31 19:14                                   ` Borislav Petkov
2021-06-01  2:07                                     ` [RFC v2-fix-v1 1/1] " Kuppuswamy Sathyanarayanan
2021-06-01 21:16                                     ` [RFC v2 28/32] " Kuppuswamy, Sathyanarayanan
2021-05-26 21:37                     ` Kuppuswamy, Sathyanarayanan
2021-05-26 22:02                       ` Tom Lendacky
2021-05-26 22:14                         ` Tom Lendacky
2021-05-26 22:20                           ` Kuppuswamy, Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 29/32] x86/tdx: Add helper to do MapGPA TDVMALL Kuppuswamy Sathyanarayanan
2021-05-19 15:59   ` Dave Hansen
2021-05-20 23:14     ` Kuppuswamy, Sathyanarayanan
2021-05-27  4:56       ` [RFC v2-fix-v1 1/1] x86/tdx: Add helper to do MapGPA hypercall Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 30/32] x86/tdx: Make DMA pages shared Kuppuswamy Sathyanarayanan
2021-05-18  1:19   ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
2021-05-18 19:55     ` Sean Christopherson
2021-05-18 22:12       ` Kuppuswamy, Sathyanarayanan
2021-05-18 22:31         ` Dave Hansen
2021-06-01  2:06           ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 31/32] x86/kvm: Use bounce buffers for TD guest Kuppuswamy Sathyanarayanan
2021-06-01  2:03   ` [RFC v2-fix-v1 1/1] " Kuppuswamy Sathyanarayanan
2021-04-26 18:01 ` [RFC v2 32/32] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kuppuswamy Sathyanarayanan
2021-05-07 23:06   ` Dave Hansen
2021-05-24 23:29     ` [RFC v2-fix-v2 1/1] " Kuppuswamy Sathyanarayanan
2021-06-01  1:28       ` [RFC v2-fix-v3 " Kuppuswamy Sathyanarayanan
2021-05-03 23:21 ` [RFC v2 00/32] Add TDX Guest Support Kuppuswamy, Sathyanarayanan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).