All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 00/32] Add TDX Guest Support
@ 2021-04-26 18:01 Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL Kuppuswamy Sathyanarayanan
                   ` (32 more replies)
  0 siblings, 33 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Hi All,

NOTE: This series is not ready for wide public review. It is being
specifically posted so that Peter Z and other experts on the entry
code can look for problems with the new exception handler (#VE).
That's also why x86@ is not being spammed.

Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. This series adds the bare-minimum
support to run a TDX guest. The host-side support will be submitted
separately. Also support for advanced TD guest features like attestation
or debug-mode will be submitted separately. Also, at this point it is not
secure with some known holes in drivers, and also hasn’t been fully audited
and fuzzed yet.

TDX has a lot of similarities to SEV. It enhances confidentiality and
of guest memory and state (like registers) and includes a new exception
(#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
yet), TDX limits the host's ability to effect changes in the guest
physical address space.

In contrast to the SEV code in the kernel, TDX guest memory is integrity
protected and isolated; the host is prevented from accessing guest
memory (even ciphertext).

The TDX architecture also includes a new CPU mode called
Secure-Arbitration Mode (SEAM). The software (TDX module) running in this
mode arbitrates interactions between host and guest and implements many of
the guarantees of the TDX architecture.

Some of the key differences between TD and regular VM is,

1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
2. A new #VE exception handler is added. The TDX module injects #VE exception
   to the guest TD in cases of instructions that need to be emulated, disallowed
   MSR accesses, subset of CPUID leaves, etc.
3. By default memory is marked as private, and TD will selectively share it with
   VMM based on need.
4. Remote attestation is supported to enable a third party (either the owner of
   the workload or a user of the services provided by the workload) to establish
   that the workload is running on an Intel-TDX-enabled platform located within a
   TD prior to providing that workload data.

You can find TDX related documents in the following link.

https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html

Changes since v1:
 * Implemented tdcall() and tdvmcall() helper functions in assembly and renamed
   them as __tdcall() and __tdvmcall().
 * Added do_general_protection() helper function to re-use protection
   code between #GP exception and TDX #VE exception handlers.
 * Addressed syscall gap issue in #VE handler support (for details check
   the commit log in "x86/traps: Add #VE support for TDX guest").
 * Modified patch titled "x86/tdx: Handle port I/O" to re-use common
   tdvmcall() helper function.
 * Added error handling support to MADT CPU wakeup code.
 * Introduced enum tdx_map_type to identify SHARED vs PRIVATE memory type.
 * Enabled shared memory in IOAPIC driver.
 * Added BINUTILS version info for TDCALL.
 * Changed the TDVMCALL vendor id from 0 to "TDX.KVM".
 * Replaced WARN() with pr_warn_ratelimited() in __tdvmcall() wrappers.
 * Fixed commit log and code comments related review comments.
 * Renamed patch titled # "x86/topology: Disable CPU hotplug support for TDX
   platforms" to "x86/topology: Disable CPU online/offline control for
   TDX guest"
 * Rebased on top of v5.12 kernel.


Erik Kaneda (1):
  ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Structure

Isaku Yamahata (1):
  x86/tdx: ioapic: Add shared bit for IOAPIC base address

Kirill A. Shutemov (16):
  x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  x86/tdx: Get TD execution environment information via TDINFO
  x86/traps: Add #VE support for TDX guest
  x86/tdx: Add HLT support for TDX guest
  x86/tdx: Wire up KVM hypercalls
  x86/tdx: Add MSR support for TDX guest
  x86/tdx: Handle CPUID via #VE
  x86/io: Allow to override inX() and outX() implementation
  x86/tdx: Handle port I/O
  x86/tdx: Handle in-kernel MMIO
  x86/mm: Move force_dma_unencrypted() to common code
  x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  x86/tdx: Make pages shared in ioremap()
  x86/tdx: Add helper to do MapGPA TDVMALL
  x86/tdx: Make DMA pages shared
  x86/kvm: Use bounce buffers for TD guest

Kuppuswamy Sathyanarayanan (10):
  x86/tdx: Introduce INTEL_TDX_GUEST config option
  x86/cpufeatures: Add TDX Guest CPU feature
  x86/x86: Add is_tdx_guest() interface
  x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  x86/traps: Add do_general_protection() helper function
  x86/tdx: Handle MWAIT, MONITOR and WBINVD
  ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Mailbox Structure
  ACPI/table: Print MADT Wake table information
  x86/acpi, x86/boot: Add multiprocessor wake-up support
  x86/topology: Disable CPU online/offline control for TDX guest

Sean Christopherson (4):
  x86/boot: Add a trampoline for APs booting in 64-bit mode
  x86/boot: Avoid #VE during compressed boot for TDX platforms
  x86/boot: Avoid unnecessary #VE during boot process
  x86/tdx: Forcefully disable legacy PIC for TDX guests

 arch/x86/Kconfig                         |  28 +-
 arch/x86/boot/compressed/Makefile        |   2 +
 arch/x86/boot/compressed/head_64.S       |  10 +-
 arch/x86/boot/compressed/misc.h          |   1 +
 arch/x86/boot/compressed/pgtable.h       |   2 +-
 arch/x86/boot/compressed/tdcall.S        |   9 +
 arch/x86/boot/compressed/tdx.c           |  32 ++
 arch/x86/include/asm/apic.h              |   3 +
 arch/x86/include/asm/cpufeatures.h       |   1 +
 arch/x86/include/asm/idtentry.h          |   4 +
 arch/x86/include/asm/io.h                |  24 +-
 arch/x86/include/asm/irqflags.h          |  38 +-
 arch/x86/include/asm/kvm_para.h          |  21 +
 arch/x86/include/asm/paravirt.h          |  22 +-
 arch/x86/include/asm/paravirt_types.h    |   3 +-
 arch/x86/include/asm/pgtable.h           |   3 +
 arch/x86/include/asm/realmode.h          |   1 +
 arch/x86/include/asm/tdx.h               | 176 +++++++++
 arch/x86/kernel/Makefile                 |   1 +
 arch/x86/kernel/acpi/boot.c              |  79 ++++
 arch/x86/kernel/apic/apic.c              |   8 +
 arch/x86/kernel/apic/io_apic.c           |  12 +-
 arch/x86/kernel/asm-offsets.c            |  22 ++
 arch/x86/kernel/head64.c                 |   3 +
 arch/x86/kernel/head_64.S                |  13 +-
 arch/x86/kernel/idt.c                    |   6 +
 arch/x86/kernel/paravirt.c               |   4 +-
 arch/x86/kernel/pci-swiotlb.c            |   2 +-
 arch/x86/kernel/smpboot.c                |   5 +
 arch/x86/kernel/tdcall.S                 | 361 +++++++++++++++++
 arch/x86/kernel/tdx-kvm.c                |  45 +++
 arch/x86/kernel/tdx.c                    | 480 +++++++++++++++++++++++
 arch/x86/kernel/topology.c               |   3 +-
 arch/x86/kernel/traps.c                  |  81 ++--
 arch/x86/mm/Makefile                     |   2 +
 arch/x86/mm/ioremap.c                    |   8 +-
 arch/x86/mm/mem_encrypt.c                |  75 ----
 arch/x86/mm/mem_encrypt_common.c         |  85 ++++
 arch/x86/mm/mem_encrypt_identity.c       |   1 +
 arch/x86/mm/pat/set_memory.c             |  48 ++-
 arch/x86/realmode/rm/header.S            |   1 +
 arch/x86/realmode/rm/trampoline_64.S     |  49 ++-
 arch/x86/realmode/rm/trampoline_common.S |   5 +-
 drivers/acpi/tables.c                    |  11 +
 include/acpi/actbl2.h                    |  26 +-
 45 files changed, 1654 insertions(+), 162 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdcall.S
 create mode 100644 arch/x86/boot/compressed/tdx.c
 create mode 100644 arch/x86/include/asm/tdx.h
 create mode 100644 arch/x86/kernel/tdcall.S
 create mode 100644 arch/x86/kernel/tdx-kvm.c
 create mode 100644 arch/x86/kernel/tdx.c
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-27 17:31   ` Borislav Petkov
  2021-04-26 18:01 ` [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option Kuppuswamy Sathyanarayanan
                   ` (31 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Split off halt paravirt calls from CONFIG_PARAVIRT_XXL into
a separate config option. It provides a middle ground for
not-so-deep paravirtulized environments.

CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
config would be a bloat for TDX.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/Kconfig                      |  4 +++
 arch/x86/boot/compressed/misc.h       |  1 +
 arch/x86/include/asm/irqflags.h       | 38 +++++++++++++++------------
 arch/x86/include/asm/paravirt.h       | 22 +++++++++-------
 arch/x86/include/asm/paravirt_types.h |  3 ++-
 arch/x86/kernel/paravirt.c            |  4 ++-
 arch/x86/mm/mem_encrypt_identity.c    |  1 +
 7 files changed, 44 insertions(+), 29 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..6b4b682af468 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -783,8 +783,12 @@ config PARAVIRT
 	  over full virtualization.  However, when run without a hypervisor
 	  the kernel is theoretically slower and slightly larger.
 
+config PARAVIRT_XL
+	bool
+
 config PARAVIRT_XXL
 	bool
+	select PARAVIRT_XL
 
 config PARAVIRT_DEBUG
 	bool "paravirt-ops debugging"
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 901ea5ebec22..4b84abe43765 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -9,6 +9,7 @@
  * paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
+#undef CONFIG_PARAVIRT_XL
 #undef CONFIG_PARAVIRT_XXL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 #undef CONFIG_KASAN
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 144d70ea4393..1688841893d7 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -59,27 +59,11 @@ static inline __cpuidle void native_halt(void)
 
 #endif
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_XL
 #include <asm/paravirt.h>
 #else
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
-
-static __always_inline unsigned long arch_local_save_flags(void)
-{
-	return native_save_fl();
-}
-
-static __always_inline void arch_local_irq_disable(void)
-{
-	native_irq_disable();
-}
-
-static __always_inline void arch_local_irq_enable(void)
-{
-	native_irq_enable();
-}
-
 /*
  * Used in the idle loop; sti takes one instruction cycle
  * to complete:
@@ -97,6 +81,26 @@ static inline __cpuidle void halt(void)
 {
 	native_halt();
 }
+#endif /* !__ASSEMBLY__ */
+#endif /* CONFIG_PARAVIRT_XL */
+
+#ifndef CONFIG_PARAVIRT_XXL
+#ifndef __ASSEMBLY__
+
+static __always_inline unsigned long arch_local_save_flags(void)
+{
+	return native_save_fl();
+}
+
+static __always_inline void arch_local_irq_disable(void)
+{
+	native_irq_disable();
+}
+
+static __always_inline void arch_local_irq_enable(void)
+{
+	native_irq_enable();
+}
 
 /*
  * For spinlocks, etc:
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4abf110e2243..2dbb6c9c7e98 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,6 +84,18 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 	PVOP_VCALL1(mmu.exit_mmap, mm);
 }
 
+#ifdef CONFIG_PARAVIRT_XL
+static inline void arch_safe_halt(void)
+{
+	PVOP_VCALL0(irq.safe_halt);
+}
+
+static inline void halt(void)
+{
+	PVOP_VCALL0(irq.halt);
+}
+#endif
+
 #ifdef CONFIG_PARAVIRT_XXL
 static inline void load_sp0(unsigned long sp0)
 {
@@ -145,16 +157,6 @@ static inline void __write_cr4(unsigned long x)
 	PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-static inline void arch_safe_halt(void)
-{
-	PVOP_VCALL0(irq.safe_halt);
-}
-
-static inline void halt(void)
-{
-	PVOP_VCALL0(irq.halt);
-}
-
 static inline void wbinvd(void)
 {
 	PVOP_VCALL0(cpu.wbinvd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index de87087d3bde..5261fba47ba5 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -177,7 +177,8 @@ struct pv_irq_ops {
 	struct paravirt_callee_save save_fl;
 	struct paravirt_callee_save irq_disable;
 	struct paravirt_callee_save irq_enable;
-
+#endif
+#ifdef CONFIG_PARAVIRT_XL
 	void (*safe_halt)(void);
 	void (*halt)(void);
 #endif
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c60222ab8ab9..d6d0b363fe70 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -322,9 +322,11 @@ struct paravirt_patch_template pv_ops = {
 	.irq.save_fl		= __PV_IS_CALLEE_SAVE(native_save_fl),
 	.irq.irq_disable	= __PV_IS_CALLEE_SAVE(native_irq_disable),
 	.irq.irq_enable		= __PV_IS_CALLEE_SAVE(native_irq_enable),
+#endif /* CONFIG_PARAVIRT_XXL */
+#ifdef CONFIG_PARAVIRT_XL
 	.irq.safe_halt		= native_safe_halt,
 	.irq.halt		= native_halt,
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_XL */
 
 	/* Mmu ops. */
 	.mmu.flush_tlb_user	= native_flush_tlb_local,
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 6c5eb6f3f14f..20d0cb116557 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -24,6 +24,7 @@
  * be extended when new paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
+#undef CONFIG_PARAVIRT_XL
 #undef CONFIG_PARAVIRT_XXL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 21:09   ` Randy Dunlap
  2021-04-26 18:01 ` [RFC v2 03/32] x86/cpufeatures: Add TDX Guest CPU feature Kuppuswamy Sathyanarayanan
                   ` (30 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Add INTEL_TDX_GUEST config option to selectively compile
TDX guest support.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/Kconfig | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6b4b682af468..932e6d759ba7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -875,6 +875,21 @@ config ACRN_GUEST
 	  IOT with small footprint and real-time features. More details can be
 	  found in https://projectacrn.org/.
 
+config INTEL_TDX_GUEST
+	bool "Intel Trusted Domain eXtensions Guest Support"
+	depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
+	depends on SECURITY
+	select PARAVIRT_XL
+	select X86_X2APIC
+	select SECURITY_LOCKDOWN_LSM
+	help
+	  Provide support for running in a trusted domain on Intel processors
+	  equipped with Trusted Domain eXtenstions. TDX is an new Intel
+	  technology that extends VMX and Memory Encryption with a new kind of
+	  virtual machine guest called Trust Domain (TD). A TD is designed to
+	  run in a CPU mode that protects the confidentiality of TD memory
+	  contents and the TD’s CPU state from other software, including VMM.
+
 endif #HYPERVISOR_GUEST
 
 source "arch/x86/Kconfig.cpu"
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 03/32] x86/cpufeatures: Add TDX Guest CPU feature
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 04/32] x86/x86: Add is_tdx_guest() interface Kuppuswamy Sathyanarayanan
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Add CPU feature detection for Trusted Domain Extensions support.
TDX feature adds capabilities to keep guest register state and
memory isolated from hypervisor.

For TDX guest platforms, executing CPUID(0x21, 0) will return
following values in EAX, EBX, ECX and EDX.

EAX:  Maximum sub-leaf number:  0
EBX/EDX/ECX:  Vendor string:

EBX =  "Inte"
EDX =  "lTDX"
ECX =  "    "

So when above condition is true, set X86_FEATURE_TDX_GUEST
feature cap bit

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/tdx.h         | 20 ++++++++++++++++++++
 arch/x86/kernel/Makefile           |  1 +
 arch/x86/kernel/head64.c           |  3 +++
 arch/x86/kernel/tdx.c              | 30 ++++++++++++++++++++++++++++++
 5 files changed, 55 insertions(+)
 create mode 100644 arch/x86/include/asm/tdx.h
 create mode 100644 arch/x86/kernel/tdx.c

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index cc96e26d69f7..d883df70c27b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -236,6 +236,7 @@
 #define X86_FEATURE_EPT_AD		( 8*32+17) /* Intel Extended Page Table access-dirty bit */
 #define X86_FEATURE_VMCALL		( 8*32+18) /* "" Hypervisor supports the VMCALL instruction */
 #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
+#define X86_FEATURE_TDX_GUEST		( 8*32+20) /* Trusted Domain Extensions Guest */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE		( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
new file mode 100644
index 000000000000..679500e807f3
--- /dev/null
+++ b/arch/x86/include/asm/tdx.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_TDX_H
+#define _ASM_X86_TDX_H
+
+#define TDX_CPUID_LEAF_ID	0x21
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+
+#include <asm/cpufeature.h>
+
+void __init tdx_early_init(void);
+
+#else // !CONFIG_INTEL_TDX_GUEST
+
+static inline void tdx_early_init(void) { };
+
+#endif /* CONFIG_INTEL_TDX_GUEST */
+
+#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 2ddf08351f0b..ea111bf50691 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,6 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 5e9beb77cafd..75f2401cb5db 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -40,6 +40,7 @@
 #include <asm/extable.h>
 #include <asm/trapnr.h>
 #include <asm/sev-es.h>
+#include <asm/tdx.h>
 
 /*
  * Manage page tables very early on.
@@ -491,6 +492,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 
 	kasan_early_init();
 
+	tdx_early_init();
+
 	idt_setup_early_handler();
 
 	copy_bootdata(__va(real_mode_data));
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
new file mode 100644
index 000000000000..f927e36769d5
--- /dev/null
+++ b/arch/x86/kernel/tdx.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2020 Intel Corporation */
+
+#include <asm/tdx.h>
+
+static inline bool cpuid_has_tdx_guest(void)
+{
+	u32 eax, signature[3];
+
+	if (cpuid_eax(0) < TDX_CPUID_LEAF_ID)
+		return false;
+
+	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &signature[0],
+			&signature[1], &signature[2]);
+
+	if (memcmp("IntelTDX    ", signature, 12))
+		return false;
+
+	return true;
+}
+
+void __init tdx_early_init(void)
+{
+	if (!cpuid_has_tdx_guest())
+		return;
+
+	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+	pr_info("TDX guest is initialized\n");
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 04/32] x86/x86: Add is_tdx_guest() interface
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (2 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 03/32] x86/cpufeatures: Add TDX Guest CPU feature Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions Kuppuswamy Sathyanarayanan
                   ` (28 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan, Sean Christopherson

Add helper function to detect TDX feature support. It will be used
to protect TDX specific code.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile |  1 +
 arch/x86/boot/compressed/tdx.c    | 32 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/tdx.h        |  8 ++++++++
 arch/x86/kernel/tdx.c             |  6 ++++++
 4 files changed, 47 insertions(+)
 create mode 100644 arch/x86/boot/compressed/tdx.c

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index e0bc3988c3fa..a2554621cefe 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -96,6 +96,7 @@ ifdef CONFIG_X86_64
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
new file mode 100644
index 000000000000..0a87c1775b67
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * tdx.c - Early boot code for TDX
+ */
+
+#include <asm/tdx.h>
+
+static int __ro_after_init tdx_guest = -1;
+
+static inline bool native_cpuid_has_tdx_guest(void)
+{
+	u32 eax = TDX_CPUID_LEAF_ID, signature[3] = {0};
+
+	if (native_cpuid_eax(0) < TDX_CPUID_LEAF_ID)
+		return false;
+
+	native_cpuid(&eax, &signature[0], &signature[1], &signature[2]);
+
+	if (memcmp("IntelTDX    ", signature, 12))
+		return false;
+
+	return true;
+}
+
+bool is_tdx_guest(void)
+{
+	if (tdx_guest < 0)
+		tdx_guest = native_cpuid_has_tdx_guest();
+
+	return !!tdx_guest;
+}
+
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 679500e807f3..69af72d08d3d 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -9,10 +9,18 @@
 
 #include <asm/cpufeature.h>
 
+/* Common API to check TDX support in decompression and common kernel code. */
+bool is_tdx_guest(void);
+
 void __init tdx_early_init(void);
 
 #else // !CONFIG_INTEL_TDX_GUEST
 
+static inline bool is_tdx_guest(void)
+{
+	return false;
+}
+
 static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index f927e36769d5..6a7193fead08 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -19,6 +19,12 @@ static inline bool cpuid_has_tdx_guest(void)
 	return true;
 }
 
+bool is_tdx_guest(void)
+{
+	return static_cpu_has(X86_FEATURE_TDX_GUEST);
+}
+EXPORT_SYMBOL_GPL(is_tdx_guest);
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (3 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 04/32] x86/x86: Add is_tdx_guest() interface Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 20:32   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 06/32] x86/tdx: Get TD execution environment information via TDINFO Kuppuswamy Sathyanarayanan
                   ` (27 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
expose guest state to the host.  This prevents the old hypercall
mechanisms from working. So to communicate with VMM, TDX
specification defines a new instruction called "tdcall".

In TDX based VM, since VMM is an untrusted entity, a intermediary
layer (TDX module) exists between host and guest to facilitate the
secure communication. And "tdcall" instruction  is used by the guest
to request services from TDX module. And a variant of "tdcall"
instruction (with specific arguments as defined by GHCI) is used by
the guest to request services from  VMM via the TDX module.

Implement common helper functions to communicate with the TDX Module
and VMM (using TDCALL instruction).
   
__tdvmcall() - function can be used to request services from the VMM.
   
__tdcall()  - function can be used to communicate with the TDX Module.

Also define two additional wrappers, tdvmcall() and tdvmcall_out_r11()
to cover common use cases of __tdvmcall() function. Since each use
case of __tdcall() is different, we don't need such wrappers for it.

Implement __tdcall() and __tdvmcall() helper functions in assembly.
Rationale behind choosing to use assembly over inline assembly are,

1. Since the number of lines of instructions (with comments) in
__tdvmcall() implementation is over 70, using inline assembly to
implement it will make it hard to read.
   
2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
TDVMCAL/TDCALL operation, if all these registers are included in
in-line assembly constraints, some of the older compilers may not
be able to meet this requirement.

Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
use the same set of argument registers. The implementation here picks
the current worst-case scenario for TDCALL (4 registers). For TDCALLs
with fewer than 4 arguments, there will end up being a few superfluous
(cheap) instructions.  But, this approach maximizes code reuse. The
same argument applies to __tdvmcall() function as well.

Current implementation of __tdvmcall()  includes error handling (ud2
on failure case) in assembly function instead of doing it in C wrapper
function. The reason behind this choice is, when adding support for
in/out instructions (refer to patch titled "x86/tdx: Handle port I/O"
in this series), we use alternative_io() to substitute in/out
instruction with  __tdvmcall() calls. So use of C wrappers is not trivial
in this case because the input parameters will be in the wrong registers
and it's tricky to include proper buffer code to make this happen.

For registers used by TDCALL instruction, please check TDX GHCI
specification, sec 2.4 and 3.

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Originally-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/tdx.h    |  26 +++++
 arch/x86/kernel/Makefile      |   2 +-
 arch/x86/kernel/asm-offsets.c |  22 ++++
 arch/x86/kernel/tdcall.S      | 200 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c         |  36 ++++++
 5 files changed, 285 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..6c3c71bb57a0 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,38 @@
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
+#include <linux/types.h>
+
+struct tdcall_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+struct tdvmcall_output {
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+};
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
 void __init tdx_early_init(void);
 
+/* Helper function used to communicate with the TDX module */
+u64 __tdcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+	     struct tdcall_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+	       struct tdvmcall_output *out);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..4a9885a9a28b 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
 #include <xen/interface/xen.h>
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
 #ifdef CONFIG_X86_32
 # include "asm-offsets_32.c"
 #else
@@ -75,6 +79,24 @@ static void __used common(void)
 	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+	BLANK();
+	/* Offset for fields in tdcall_output */
+	OFFSET(TDCALL_rcx, tdcall_output, rcx);
+	OFFSET(TDCALL_rdx, tdcall_output, rdx);
+	OFFSET(TDCALL_r8,  tdcall_output, r8);
+	OFFSET(TDCALL_r9,  tdcall_output, r9);
+	OFFSET(TDCALL_r10, tdcall_output, r10);
+	OFFSET(TDCALL_r11, tdcall_output, r11);
+
+	/* Offset for fields in tdvmcall_output */
+	OFFSET(TDVMCALL_r11, tdvmcall_output, r11);
+	OFFSET(TDVMCALL_r12, tdvmcall_output, r12);
+	OFFSET(TDVMCALL_r13, tdvmcall_output, r13);
+	OFFSET(TDVMCALL_r14, tdvmcall_output, r14);
+	OFFSET(TDVMCALL_r15, tdvmcall_output, r15);
+#endif
+
 	BLANK();
 	OFFSET(BP_scratch, boot_params, scratch);
 	OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..81af70c2acbd
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,200 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+
+/*
+ * Expose registers R10-R15 to VMM (for bitfield info
+ * refer to TDX GHCI specification).
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK	0xfc00
+
+/*
+ * TDX guests use the TDCALL instruction to make
+ * hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdcall()  - Used to communicate with the TDX module
+ *
+ * @arg1 (RDI) - TDCALL Leaf ID
+ * @arg2 (RSI) - Input parameter 1 passed to TDX module
+ *               via register RCX
+ * @arg2 (RDX) - Input parameter 2 passed to TDX module
+ *               via register RDX
+ * @arg3 (RCX) - Input parameter 3 passed to TDX module
+ *               via register R8
+ * @arg4 (R8)  - Input parameter 4 passed to TDX module
+ *               via register R9
+ * @arg5 (R9)  - struct tdcall_output pointer
+ *
+ * @out        - Return status of tdcall via RAX.
+ *
+ * NOTE: This function should only used for non TDVMCALL
+ *       use cases
+ */
+SYM_FUNC_START(__tdcall)
+	FRAME_BEGIN
+
+	/* Save non-volatile GPRs that are exposed to the VMM. */
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+
+	/* Move TDCALL Leaf ID to RAX */
+	mov %rdi, %rax
+	/* Move output pointer to R12 */
+	mov %r9, %r12
+	/* Move input param 4 to R9 */
+	mov %r8, %r9
+	/* Move input param 3 to R8 */
+	mov %rcx, %r8
+	/* Leave input param 2 in RDX */
+	/* Move input param 1 to RCX */
+	mov %rsi, %rcx
+
+	tdcall
+
+	/* Check for TDCALL success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check for a TDCALL output struct */
+	test %r12, %r12
+	jz 1f
+
+	/* Copy TDCALL result registers to output struct: */
+	movq %rcx, TDCALL_rcx(%r12)
+	movq %rdx, TDCALL_rdx(%r12)
+	movq %r8,  TDCALL_r8(%r12)
+	movq %r9,  TDCALL_r9(%r12)
+	movq %r10, TDCALL_r10(%r12)
+	movq %r11, TDCALL_r11(%r12)
+1:
+	/* Zero out registers exposed to the TDX Module. */
+	xor %rcx,  %rcx
+	xor %rdx,  %rdx
+	xor %r8d,  %r8d
+	xor %r9d,  %r9d
+	xor %r10d, %r10d
+	xor %r11d, %r11d
+
+	/* Restore non-volatile GPRs that are exposed to the VMM. */
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	FRAME_END
+	ret
+SYM_FUNC_END(__tdcall)
+
+/*
+ * do_tdvmcall()  - Used to communicate with the VMM.
+ *
+ * @arg1 (RDI)    - TDVMCALL function, e.g. exit reason
+ * @arg2 (RSI)    - Input parameter 1 passed to VMM
+ *                  via register R12
+ * @arg3 (RDX)    - Input parameter 2 passed to VMM
+ *                  via register R13
+ * @arg4 (RCX)    - Input parameter 3 passed to VMM
+ *                  via register R14
+ * @arg5 (R8)     - Input parameter 4 passed to VMM
+ *                  via register R15
+ * @arg6 (R9)     - struct tdvmcall_output pointer
+ *
+ * @out           - Return status of tdvmcall(R10) via RAX.
+ *
+ */
+SYM_CODE_START_LOCAL(do_tdvmcall)
+	FRAME_BEGIN
+
+	/* Save non-volatile GPRs that are exposed to the VMM. */
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+
+	/* Set TDCALL leaf ID to TDVMCALL (0) in RAX */
+	xor %eax, %eax
+	/* Move TDVMCALL function id (1st argument) to R11 */
+	mov %rdi, %r11
+	/* Move Input parameter 1-4 to R12-R15 */
+	mov %rsi, %r12
+	mov %rdx, %r13
+	mov %rcx, %r14
+	mov %r8,  %r15
+	/* Leave tdvmcall output pointer in R9 */
+
+	/*
+	 * Value of RCX is used by the TDX Module to determine which
+	 * registers are exposed to VMM. Each bit in RCX represents a
+	 * register id. You can find the bitmap details from TDX GHCI
+	 * spec.
+	 */
+	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+	tdcall
+
+	/*
+	 * Check for TDCALL success: 0 - Successful, otherwise failed.
+	 * If failed, there is an issue with TDX Module which is fatal
+	 * for the guest. So panic.
+	 */
+	test %rax, %rax
+	jnz 2f
+
+	/* Move TDVMCALL success/failure to RAX to return to user */
+	mov %r10, %rax
+
+	/* Check for TDVMCALL success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check for a TDVMCALL output struct */
+	test %r9, %r9
+	jz 1f
+
+	/* Copy TDVMCALL result registers to output struct: */
+	movq %r11, TDVMCALL_r11(%r9)
+	movq %r12, TDVMCALL_r12(%r9)
+	movq %r13, TDVMCALL_r13(%r9)
+	movq %r14, TDVMCALL_r14(%r9)
+	movq %r15, TDVMCALL_r15(%r9)
+1:
+	/*
+	 * Zero out registers exposed to the VMM to avoid
+	 * speculative execution with VMM-controlled values.
+	 */
+	xor %r10d, %r10d
+	xor %r11d, %r11d
+	xor %r12d, %r12d
+	xor %r13d, %r13d
+	xor %r14d, %r14d
+	xor %r15d, %r15d
+
+	/* Restore non-volatile GPRs that are exposed to the VMM. */
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	FRAME_END
+	ret
+2:
+	ud2
+SYM_CODE_END(do_tdvmcall)
+
+/* Helper function for standard type of TDVMCALL */
+SYM_FUNC_START(__tdvmcall)
+	/* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+	xor %r10, %r10
+	call do_tdvmcall
+	retq
+SYM_FUNC_END(__tdvmcall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 6a7193fead08..29c52128b9c0 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -1,8 +1,44 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (C) 2020 Intel Corporation */
 
+#define pr_fmt(fmt) "TDX: " fmt
+
 #include <asm/tdx.h>
 
+/*
+ * Wrapper for use case that checks for error code and print warning message.
+ */
+static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+	u64 err;
+
+	err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return err;
+}
+
+/*
+ * Wrapper for the semi-common case where we need single output value (R11).
+ */
+static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+
+	struct tdvmcall_output out = {0};
+	u64 err;
+
+	err = __tdvmcall(fn, r12, r13, r14, r15, &out);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return out.r11;
+}
+
 static inline bool cpuid_has_tdx_guest(void)
 {
 	u32 eax, signature[3];
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 06/32] x86/tdx: Get TD execution environment information via TDINFO
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (4 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 07/32] x86/traps: Add do_general_protection() helper function Kuppuswamy Sathyanarayanan
                   ` (26 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Per Guest-Host-Communication Interface (GHCI) for Intel Trust
Domain Extensions (Intel TDX) specification, sec 2.4.2,
TDCALL[TDINFO] provides basic TD execution environment information, not
provided by CPUID.

Call TDINFO during early boot to be used for following system
initialization.

The call provides info on which bit in pfn is used to indicate that the
page is shared with the host and attributes of the TD, such as debug.

We don't save information about the number of cpus as there's no users
so far.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/tdx.h |  2 ++
 arch/x86/kernel/tdx.c      | 23 +++++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 6c3c71bb57a0..c5a870cef0ae 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -10,6 +10,8 @@
 #include <asm/cpufeature.h>
 #include <linux/types.h>
 
+#define TDINFO			1
+
 struct tdcall_output {
 	u64 rcx;
 	u64 rdx;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 29c52128b9c0..b63275db1db9 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -4,6 +4,14 @@
 #define pr_fmt(fmt) "TDX: " fmt
 
 #include <asm/tdx.h>
+#include <asm/vmx.h>
+
+#include <linux/cpu.h>
+
+static struct {
+	unsigned int gpa_width;
+	unsigned long attributes;
+} td_info __ro_after_init;
 
 /*
  * Wrapper for use case that checks for error code and print warning message.
@@ -61,6 +69,19 @@ bool is_tdx_guest(void)
 }
 EXPORT_SYMBOL_GPL(is_tdx_guest);
 
+static void tdg_get_info(void)
+{
+	u64 ret;
+	struct tdcall_output out = {0};
+
+	ret = __tdcall(TDINFO, 0, 0, 0, 0, &out);
+
+	BUG_ON(ret);
+
+	td_info.gpa_width = out.rcx & GENMASK(5, 0);
+	td_info.attributes = out.rdx;
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
@@ -68,5 +89,7 @@ void __init tdx_early_init(void)
 
 	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
+	tdg_get_info();
+
 	pr_info("TDX guest is initialized\n");
 }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 07/32] x86/traps: Add do_general_protection() helper function
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (5 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 06/32] x86/tdx: Get TD execution environment information via TDINFO Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-07 21:20   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 08/32] x86/traps: Add #VE support for TDX guest Kuppuswamy Sathyanarayanan
                   ` (25 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

TDX guest #VE exception handler treats unsupported exceptions
as #GP. So to handle the #GP, move the protection fault handler
code to out of exc_general_protection() and create new helper
function for it.

Also since exception handler is responsible to decide when to
turn on/off IRQ, move cond_local_irq_{enable/disable)() calls
out of do_general_protection().

This is a preparatory patch for adding #VE exception handler
support for TDX guests.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/traps.c | 51 ++++++++++++++++++++++-------------------
 1 file changed, 27 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 651e3e508959..213d4aa8e337 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -527,44 +527,28 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
 
 #define GPFSTR "general protection fault"
 
-DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
+static void do_general_protection(struct pt_regs *regs, long error_code)
 {
 	char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
 	enum kernel_gp_hint hint = GP_NO_HINT;
-	struct task_struct *tsk;
+	struct task_struct *tsk = current;
 	unsigned long gp_addr;
 	int ret;
 
-	cond_local_irq_enable(regs);
-
-	if (static_cpu_has(X86_FEATURE_UMIP)) {
-		if (user_mode(regs) && fixup_umip_exception(regs))
-			goto exit;
-	}
-
-	if (v8086_mode(regs)) {
-		local_irq_enable();
-		handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
-		local_irq_disable();
-		return;
-	}
-
-	tsk = current;
-
 	if (user_mode(regs)) {
 		tsk->thread.error_code = error_code;
 		tsk->thread.trap_nr = X86_TRAP_GP;
 
 		if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0))
-			goto exit;
+			return;
 
 		show_signal(tsk, SIGSEGV, "", desc, regs, error_code);
 		force_sig(SIGSEGV);
-		goto exit;
+		return;
 	}
 
 	if (fixup_exception(regs, X86_TRAP_GP, error_code, 0))
-		goto exit;
+		return;
 
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_nr = X86_TRAP_GP;
@@ -576,11 +560,11 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	if (!preemptible() &&
 	    kprobe_running() &&
 	    kprobe_fault_handler(regs, X86_TRAP_GP))
-		goto exit;
+		return;
 
 	ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV);
 	if (ret == NOTIFY_STOP)
-		goto exit;
+		return;
 
 	if (error_code)
 		snprintf(desc, sizeof(desc), "segment-related " GPFSTR);
@@ -601,8 +585,27 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 		gp_addr = 0;
 
 	die_addr(desc, regs, error_code, gp_addr);
+}
 
-exit:
+DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
+{
+	cond_local_irq_enable(regs);
+
+	if (static_cpu_has(X86_FEATURE_UMIP)) {
+		if (user_mode(regs) && fixup_umip_exception(regs)) {
+			cond_local_irq_disable(regs);
+			return;
+		}
+	}
+
+	if (v8086_mode(regs)) {
+		local_irq_enable();
+		handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
+		local_irq_disable();
+		return;
+	}
+
+	do_general_protection(regs, error_code);
 	cond_local_irq_disable(regs);
 }
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (6 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 07/32] x86/traps: Add do_general_protection() helper function Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-07 21:36   ` Dave Hansen
  2021-06-08 17:02   ` [RFC v2 08/32] " Dave Hansen
  2021-04-26 18:01 ` [RFC v2 09/32] x86/tdx: Add HLT " Kuppuswamy Sathyanarayanan
                   ` (24 subsequent siblings)
  32 siblings, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The TDX module injects #VE exception to the guest TD in cases of
disallowed instructions, disallowed MSR accesses and subset of CPUID
leaves. The TDX module guarantees that no #VE is injected on an EPT
violation on guest physical addresses that are memory. We can still
get #VE on MMIO mappings. This avoids any problems with the “system
call gap”.
   
Add basic infrastructure to handle #VE. If there is no handler for a
given #VE, since it is an unexpected event (fault case), treat it as
a general protection fault and handle it using
do_general_protection() call.
   
TDCALL[TDGETVEINFO] provides information about #VE such as exit reason.

The #VE cannot be nested before TDGETVEINFO is called, if there is any
reason for it to nest the TD would shut down. The TDX module guarantees
that no NMIs (or #MC or similar) can happen in this window. After
TDGETVEINFO the #VE handler can nest if needed, although we don’t expect
it to happen normally.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/idtentry.h |  4 ++++
 arch/x86/include/asm/tdx.h      | 15 +++++++++++++
 arch/x86/kernel/idt.c           |  6 ++++++
 arch/x86/kernel/tdx.c           | 38 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps.c         | 30 ++++++++++++++++++++++++++
 5 files changed, 93 insertions(+)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5eb3bdf36a41..41a0732d5f68 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -619,6 +619,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER,	exc_xen_hypervisor_callback);
 DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER,	exc_xen_unknown_trap);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE,		exc_virtualization_exception);
+#endif
+
 /* Device interrupts common/spurious */
 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	common_interrupt);
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index c5a870cef0ae..1ca55d8e9963 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -11,6 +11,7 @@
 #include <linux/types.h>
 
 #define TDINFO			1
+#define TDGETVEINFO		3
 
 struct tdcall_output {
 	u64 rcx;
@@ -29,6 +30,20 @@ struct tdvmcall_output {
 	u64 r15;
 };
 
+struct ve_info {
+	u64 exit_reason;
+	u64 exit_qual;
+	u64 gla;
+	u64 gpa;
+	u32 instr_len;
+	u32 instr_info;
+};
+
+unsigned long tdg_get_ve_info(struct ve_info *ve);
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+		struct ve_info *ve);
+
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..546b6b636c7d 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = {
 	 */
 	INTG(X86_TRAP_PF,		asm_exc_page_fault),
 #endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
+#endif
 };
 
 /*
@@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = {
 	INTG(X86_TRAP_MF,		asm_exc_coprocessor_error),
 	INTG(X86_TRAP_AC,		asm_exc_alignment_check),
 	INTG(X86_TRAP_XF,		asm_exc_simd_coprocessor_error),
+#ifdef CONFIG_INTEL_TDX_GUEST
+	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
+#endif
 
 #ifdef CONFIG_X86_32
 	TSKG(X86_TRAP_DF,		GDT_ENTRY_DOUBLEFAULT_TSS),
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index b63275db1db9..ccfcb07bfb2c 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -82,6 +82,44 @@ static void tdg_get_info(void)
 	td_info.attributes = out.rdx;
 }
 
+unsigned long tdg_get_ve_info(struct ve_info *ve)
+{
+	u64 ret;
+	struct tdcall_output out = {0};
+
+	/*
+	 * The #VE cannot be nested before TDGETVEINFO is called,
+	 * if there is any reason for it to nest the TD would shut
+	 * down. The TDX module guarantees that no NMIs (or #MC or
+	 * similar) can happen in this window. After TDGETVEINFO
+	 * the #VE handler can nest if needed, although we don’t
+	 * expect it to happen normally.
+	 */
+
+	ret = __tdcall(TDGETVEINFO, 0, 0, 0, 0, &out);
+
+	ve->exit_reason = out.rcx;
+	ve->exit_qual   = out.rdx;
+	ve->gla         = out.r8;
+	ve->gpa         = out.r9;
+	ve->instr_len   = out.r10 & UINT_MAX;
+	ve->instr_info  = out.r10 >> 32;
+
+	return ret;
+}
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+		struct ve_info *ve)
+{
+	/*
+	 * TODO: Add handler support for various #VE exit
+	 * reasons. It will be added by other patches in
+	 * the series.
+	 */
+	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+	return -EFAULT;
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 213d4aa8e337..64869aa88a5a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/vdso.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_X86_64
 #include <asm/x86_init.h>
@@ -1140,6 +1141,35 @@ DEFINE_IDTENTRY(exc_device_not_available)
 	}
 }
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+	struct ve_info ve;
+	int ret;
+
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+
+	/*
+	 * Consume #VE info before re-enabling interrupts. It will be
+	 * re-enabled after executing the TDGETVEINFO TDCALL.
+	 */
+	ret = tdg_get_ve_info(&ve);
+
+	cond_local_irq_enable(regs);
+
+	if (!ret)
+		ret = tdg_handle_virtualization_exception(regs, &ve);
+	/*
+	 * If tdg_handle_virtualization_exception() could not process
+	 * it successfully, treat it as #GP(0) and handle it.
+	 */
+	if (ret)
+		do_general_protection(regs, 0);
+
+	cond_local_irq_disable(regs);
+}
+#endif
+
 #ifdef CONFIG_X86_32
 DEFINE_IDTENTRY_SW(iret_error)
 {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 09/32] x86/tdx: Add HLT support for TDX guest
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (7 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 08/32] x86/traps: Add #VE support for TDX guest Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls Kuppuswamy Sathyanarayanan
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Per Guest-Host-Communication Interface (GHCI) for Intel Trust
Domain Extensions (Intel TDX) specification, sec 3.8,
TDVMCALL[Instruction.HLT] provides HLT operation. Use it to implement
halt() and safe_halt() paravirtualization calls.

The same TDVMCALL is used to handle #VE exception due to
EXIT_REASON_HLT.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 44 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 37 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ccfcb07bfb2c..5169f72b6b3f 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -82,6 +82,27 @@ static void tdg_get_info(void)
 	td_info.attributes = out.rdx;
 }
 
+static __cpuidle void tdg_halt(void)
+{
+	u64 ret;
+
+	ret = __tdvmcall(EXIT_REASON_HLT, 0, 0, 0, 0, NULL);
+
+	/* It should never fail */
+	BUG_ON(ret);
+}
+
+static __cpuidle void tdg_safe_halt(void)
+{
+	/*
+	 * Enable interrupts next to the TDVMCALL to avoid
+	 * performance degradation.
+	 */
+	asm volatile("sti\n\t");
+
+	tdg_halt();
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -111,13 +132,19 @@ unsigned long tdg_get_ve_info(struct ve_info *ve)
 int tdg_handle_virtualization_exception(struct pt_regs *regs,
 		struct ve_info *ve)
 {
-	/*
-	 * TODO: Add handler support for various #VE exit
-	 * reasons. It will be added by other patches in
-	 * the series.
-	 */
-	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
-	return -EFAULT;
+	switch (ve->exit_reason) {
+	case EXIT_REASON_HLT:
+		tdg_halt();
+		break;
+	default:
+		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+		return -EFAULT;
+	}
+
+	/* After successful #VE handling, move the IP */
+	regs->ip += ve->instr_len;
+
+	return 0;
 }
 
 void __init tdx_early_init(void)
@@ -129,5 +156,8 @@ void __init tdx_early_init(void)
 
 	tdg_get_info();
 
+	pv_ops.irq.safe_halt = tdg_safe_halt;
+	pv_ops.irq.halt = tdg_halt;
+
 	pr_info("TDX guest is initialized\n");
 }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (8 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 09/32] x86/tdx: Add HLT " Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-07 21:46   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 11/32] x86/tdx: Add MSR support for TDX guest Kuppuswamy Sathyanarayanan
                   ` (22 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

KVM hypercalls have to be wrapped into vendor-specific TDVMCALLs.

[Isaku: proposed KVM VENDOR string]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/kvm_para.h | 21 +++++++++++++++
 arch/x86/include/asm/tdx.h      | 39 ++++++++++++++++++++++++++++
 arch/x86/kernel/tdcall.S        |  7 +++++
 arch/x86/kernel/tdx-kvm.c       | 45 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c           |  4 +++
 5 files changed, 116 insertions(+)
 create mode 100644 arch/x86/kernel/tdx-kvm.c

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..2fa85481520b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,6 +6,7 @@
 #include <asm/alternative.h>
 #include <linux/interrupt.h>
 #include <uapi/asm/kvm_para.h>
+#include <asm/tdx.h>
 
 extern void kvmclock_init(void);
 
@@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 static inline long kvm_hypercall0(unsigned int nr)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall0(nr);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr)
@@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
 static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall1(nr, p1);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1)
@@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
 				  unsigned long p2)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall2(nr, p1, p2);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2)
@@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
 				  unsigned long p2, unsigned long p3)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall3(nr, p1, p2, p3);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 				  unsigned long p4)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 1ca55d8e9963..e0b3ed9e262c 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -56,6 +56,16 @@ u64 __tdcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 /* Helper function used to request services from VMM */
 u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
 	       struct tdvmcall_output *out);
+u64 __tdvmcall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+			  struct tdvmcall_output *out);
+
+long tdx_kvm_hypercall0(unsigned int nr);
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1);
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2);
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3);
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3, unsigned long p4);
 
 #else // !CONFIG_INTEL_TDX_GUEST
 
@@ -66,6 +76,35 @@ static inline bool is_tdx_guest(void)
 
 static inline void tdx_early_init(void) { };
 
+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+				      unsigned long p2)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3,
+				      unsigned long p4)
+{
+	return -ENODEV;
+}
+
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index 81af70c2acbd..964bfd7fc682 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -11,6 +11,7 @@
  * refer to TDX GHCI specification).
  */
 #define TDVMCALL_EXPOSE_REGS_MASK	0xfc00
+#define TDVMCALL_VENDOR_KVM		0x4d564b2e584454 /* "TDX.KVM" */
 
 /*
  * TDX guests use the TDCALL instruction to make
@@ -198,3 +199,9 @@ SYM_FUNC_START(__tdvmcall)
 	call do_tdvmcall
 	retq
 SYM_FUNC_END(__tdvmcall)
+
+SYM_FUNC_START(__tdvmcall_vendor_kvm)
+	movq $TDVMCALL_VENDOR_KVM, %r10
+	call do_tdvmcall
+	retq
+SYM_FUNC_END(__tdvmcall_vendor_kvm)
diff --git a/arch/x86/kernel/tdx-kvm.c b/arch/x86/kernel/tdx-kvm.c
new file mode 100644
index 000000000000..c4264e926712
--- /dev/null
+++ b/arch/x86/kernel/tdx-kvm.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+
+static long tdvmcall_vendor(unsigned int fn, unsigned long r12,
+			    unsigned long r13, unsigned long r14,
+			    unsigned long r15)
+{
+	return __tdvmcall_vendor_kvm(fn, r12, r13, r14, r15, NULL);
+}
+
+/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return tdvmcall_vendor(nr, 0, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall0);
+
+/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return tdvmcall_vendor(nr, p1, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall1);
+
+/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2)
+{
+	return tdvmcall_vendor(nr, p1, p2, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall2);
+
+/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3)
+{
+	return tdvmcall_vendor(nr, p1, p2, p3, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall3);
+
+/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3, unsigned long p4)
+{
+	return tdvmcall_vendor(nr, p1, p2, p3, p4);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 5169f72b6b3f..721c213d807d 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -8,6 +8,10 @@
 
 #include <linux/cpu.h>
 
+#ifdef CONFIG_KVM_GUEST
+#include "tdx-kvm.c"
+#endif
+
 static struct {
 	unsigned int gpa_width;
 	unsigned long attributes;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 11/32] x86/tdx: Add MSR support for TDX guest
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (9 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 12/32] x86/tdx: Handle CPUID via #VE Kuppuswamy Sathyanarayanan
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Operations on context-switched MSRs can be run natively. The rest of
MSRs should be handled through TDVMCALLs.

TDVMCALL[Instruction.RDMSR] and TDVMCALL[Instruction.WRMSR] provide
MSR oprations.

You can find RDMSR and WRMSR details in Guest-Host-Communication
Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX)
specification, sec 3.10, 3.11.

Also, since CSTAR MSR is not used on Intel CPUs as SYSCALL
instruction, ignore accesses to CSTAR MSR.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 85 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 83 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 721c213d807d..5b16707b3577 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -107,6 +107,73 @@ static __cpuidle void tdg_safe_halt(void)
 	tdg_halt();
 }
 
+static bool tdg_is_context_switched_msr(unsigned int msr)
+{
+	/*  XXX: Update the list of context-switched MSRs */
+
+	switch (msr) {
+	case MSR_EFER:
+	case MSR_IA32_CR_PAT:
+	case MSR_FS_BASE:
+	case MSR_GS_BASE:
+	case MSR_KERNEL_GS_BASE:
+	case MSR_IA32_SYSENTER_CS:
+	case MSR_IA32_SYSENTER_EIP:
+	case MSR_IA32_SYSENTER_ESP:
+	case MSR_STAR:
+	case MSR_LSTAR:
+	case MSR_SYSCALL_MASK:
+	case MSR_IA32_XSS:
+	case MSR_TSC_AUX:
+	case MSR_IA32_BNDCFGS:
+		return true;
+	}
+	return false;
+}
+
+static u64 tdg_read_msr_safe(unsigned int msr, int *err)
+{
+	u64 ret;
+	struct tdvmcall_output out = {0};
+
+	WARN_ON_ONCE(tdg_is_context_switched_msr(msr));
+
+	/*
+	 * Since CSTAR MSR is not used by Intel CPUs as SYSCALL
+	 * instruction, just ignore it. Even raising TDVMCALL
+	 * will lead to same result.
+	 */
+	if (msr == MSR_CSTAR)
+		return 0;
+
+	ret = __tdvmcall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out);
+
+	*err = (ret) ? -EIO : 0;
+
+	return out.r11;
+}
+
+static int tdg_write_msr_safe(unsigned int msr, unsigned int low,
+			      unsigned int high)
+{
+	u64 ret;
+
+	WARN_ON_ONCE(tdg_is_context_switched_msr(msr));
+
+	/*
+	 * Since CSTAR MSR is not used by Intel CPUs as SYSCALL
+	 * instruction, just ignore it. Even raising TDVMCALL
+	 * will lead to same result.
+	 */
+	if (msr == MSR_CSTAR)
+		return 0;
+
+	ret = __tdvmcall(EXIT_REASON_MSR_WRITE, msr, (u64)high << 32 | low,
+			 0, 0, NULL);
+
+	return ret ? -EIO : 0;
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -136,19 +203,33 @@ unsigned long tdg_get_ve_info(struct ve_info *ve)
 int tdg_handle_virtualization_exception(struct pt_regs *regs,
 		struct ve_info *ve)
 {
+	unsigned long val;
+	int ret = 0;
+
 	switch (ve->exit_reason) {
 	case EXIT_REASON_HLT:
 		tdg_halt();
 		break;
+	case EXIT_REASON_MSR_READ:
+		val = tdg_read_msr_safe(regs->cx, (unsigned int *)&ret);
+		if (!ret) {
+			regs->ax = val & UINT_MAX;
+			regs->dx = val >> 32;
+		}
+		break;
+	case EXIT_REASON_MSR_WRITE:
+		ret = tdg_write_msr_safe(regs->cx, regs->ax, regs->dx);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
 	}
 
 	/* After successful #VE handling, move the IP */
-	regs->ip += ve->instr_len;
+	if (!ret)
+		regs->ip += ve->instr_len;
 
-	return 0;
+	return ret;
 }
 
 void __init tdx_early_init(void)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 12/32] x86/tdx: Handle CPUID via #VE
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (10 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 11/32] x86/tdx: Add MSR support for TDX guest Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 13/32] x86/io: Allow to override inX() and outX() implementation Kuppuswamy Sathyanarayanan
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

TDX has three classes of CPUID leaves: some CPUID leaves
are always handled by the CPU, others are handled by the TDX module,
and some others are handled by the VMM. Since the VMM cannot directly
intercept the instruction these are reflected with a #VE exception
to the guest, which then converts it into a TDCALL to the VMM,
or handled directly.

The TDX module EAS has a full list of CPUID leaves which are handled
natively or by the TDX module in 16.2. Only unknown CPUIDs are handled by
the #VE method. In practice this typically only applies to the
hypervisor specific CPUIDs unknown to the native CPU.

Therefore there is no risk of causing this in early CPUID code which
runs before the #VE handler is set up because it will never access
those exotic CPUID leaves.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 5b16707b3577..e42e260df245 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -174,6 +174,21 @@ static int tdg_write_msr_safe(unsigned int msr, unsigned int low,
 	return ret ? -EIO : 0;
 }
 
+static void tdg_handle_cpuid(struct pt_regs *regs)
+{
+	u64 ret;
+	struct tdvmcall_output out = {0};
+
+	ret = __tdvmcall(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0, &out);
+
+	WARN_ON(ret);
+
+	regs->ax = out.r12;
+	regs->bx = out.r13;
+	regs->cx = out.r14;
+	regs->dx = out.r15;
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -220,6 +235,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_MSR_WRITE:
 		ret = tdg_write_msr_safe(regs->cx, regs->ax, regs->dx);
 		break;
+	case EXIT_REASON_CPUID:
+		tdg_handle_cpuid(regs);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 13/32] x86/io: Allow to override inX() and outX() implementation
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (11 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 12/32] x86/tdx: Handle CPUID via #VE Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 14/32] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The patch allows to override the implementation of the port IO
helpers. TDX code will provide an implementation that redirect the
helpers to paravirt calls.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/io.h | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d726459d08e5..ef7a686a55a9 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -271,18 +271,26 @@ static inline bool sev_key_active(void) { return false; }
 
 #endif /* CONFIG_AMD_MEM_ENCRYPT */
 
+#ifndef __out
+#define __out(bwl, bw)							\
+	asm volatile("out" #bwl " %" #bw "0, %w1" : : "a"(value), "Nd"(port))
+#endif
+
+#ifndef __in
+#define __in(bwl, bw)							\
+	asm volatile("in" #bwl " %w1, %" #bw "0" : "=a"(value) : "Nd"(port))
+#endif
+
 #define BUILDIO(bwl, bw, type)						\
 static inline void out##bwl(unsigned type value, int port)		\
 {									\
-	asm volatile("out" #bwl " %" #bw "0, %w1"			\
-		     : : "a"(value), "Nd"(port));			\
+	__out(bwl, bw);							\
 }									\
 									\
 static inline unsigned type in##bwl(int port)				\
 {									\
 	unsigned type value;						\
-	asm volatile("in" #bwl " %w1, %" #bw "0"			\
-		     : "=a"(value) : "Nd"(port));			\
+	__in(bwl, bw);							\
 	return value;							\
 }									\
 									\
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (12 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 13/32] x86/io: Allow to override inX() and outX() implementation Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-10 21:57   ` Dan Williams
  2021-04-26 18:01 ` [RFC v2 15/32] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
                   ` (18 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Unroll string operations and handle port I/O through TDVMCALLs.
Also handle #VE due to I/O operations with the same TDVMCALLs.

Decompression code uses port IO for earlyprintk. We must use
paravirt calls there too if we want to allow earlyprintk.

Decompresion code cannot deal with alternatives: use branches
instead to implement inX() and outX() helpers.

Since we use call instruction in place of in/out instruction,
the argument passed to call instruction has to be in a
register, it cannot be an immediate value like in/out
instruction. So change constraint flag from "Nd" to "d"

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile |   1 +
 arch/x86/boot/compressed/tdcall.S |   9 ++
 arch/x86/include/asm/io.h         |   5 +-
 arch/x86/include/asm/tdx.h        |  46 ++++++++-
 arch/x86/kernel/tdcall.S          | 154 ++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c             |  33 +++++++
 6 files changed, 245 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdcall.S

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index a2554621cefe..a944a2038797 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -97,6 +97,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdcall.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
new file mode 100644
index 000000000000..5ebb80d45ad8
--- /dev/null
+++ b/arch/x86/boot/compressed/tdcall.S
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <asm/export.h>
+
+/* Do not export symbols in decompression code */
+#undef EXPORT_SYMBOL
+#define EXPORT_SYMBOL(sym)
+
+#include "../../kernel/tdcall.S"
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index ef7a686a55a9..30a3b30395ad 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -43,6 +43,7 @@
 #include <asm/page.h>
 #include <asm/early_ioremap.h>
 #include <asm/pgtable_types.h>
+#include <asm/tdx.h>
 
 #define build_mmio_read(name, size, type, reg, barrier) \
 static inline type name(const volatile void __iomem *addr) \
@@ -309,7 +310,7 @@ static inline unsigned type in##bwl##_p(int port)			\
 									\
 static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 {									\
-	if (sev_key_active()) {						\
+	if (sev_key_active() || is_tdx_guest()) {			\
 		unsigned type *value = (unsigned type *)addr;		\
 		while (count) {						\
 			out##bwl(*value, port);				\
@@ -325,7 +326,7 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 									\
 static inline void ins##bwl(int port, void *addr, unsigned long count)	\
 {									\
-	if (sev_key_active()) {						\
+	if (sev_key_active() || is_tdx_guest()) {			\
 		unsigned type *value = (unsigned type *)addr;		\
 		while (count) {						\
 			*value = in##bwl(port);				\
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index e0b3ed9e262c..b972c6531a53 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -5,6 +5,8 @@
 
 #define TDX_CPUID_LEAF_ID	0x21
 
+#ifndef __ASSEMBLY__
+
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
@@ -67,6 +69,48 @@ long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
 long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
 		unsigned long p3, unsigned long p4);
 
+/* Decompression code doesn't know how to handle alternatives */
+#ifdef BOOT_COMPRESSED_MISC_H
+#define __out(bwl, bw)							\
+do {									\
+	if (is_tdx_guest()) {						\
+		asm volatile("call tdg_out" #bwl : :			\
+				"a"(value), "d"(port));			\
+	} else {							\
+		asm volatile("out" #bwl " %" #bw "0, %w1" : :		\
+				"a"(value), "Nd"(port));		\
+	}								\
+} while (0)
+#define __in(bwl, bw)							\
+do {									\
+	if (is_tdx_guest()) {						\
+		asm volatile("call tdg_in" #bwl :			\
+				"=a"(value) : "d"(port));		\
+	} else {							\
+		asm volatile("in" #bwl " %w1, %" #bw "0" :		\
+				"=a"(value) : "Nd"(port));		\
+	}								\
+} while (0)
+#else
+#define __out(bwl, bw)							\
+	alternative_input("out" #bwl " %" #bw "1, %w2",			\
+			"call tdg_out" #bwl, X86_FEATURE_TDX_GUEST,	\
+			"a"(value), "d"(port))
+
+#define __in(bwl, bw)							\
+	alternative_io("in" #bwl " %w2, %" #bw "0",			\
+			"call tdg_in" #bwl, X86_FEATURE_TDX_GUEST,	\
+			"=a"(value), "d"(port))
+#endif
+
+void tdg_outb(unsigned char value, unsigned short port);
+void tdg_outw(unsigned short value, unsigned short port);
+void tdg_outl(unsigned int value, unsigned short port);
+
+unsigned char tdg_inb(unsigned short port);
+unsigned short tdg_inw(unsigned short port);
+unsigned int tdg_inl(unsigned short port);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
@@ -106,5 +150,5 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
 }
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
-
+#endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index 964bfd7fc682..df4159bb5103 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -3,6 +3,7 @@
 #include <asm/asm.h>
 #include <asm/frame.h>
 #include <asm/unwind_hints.h>
+#include <asm/export.h>
 
 #include <linux/linkage.h>
 
@@ -12,6 +13,12 @@
  */
 #define TDVMCALL_EXPOSE_REGS_MASK	0xfc00
 #define TDVMCALL_VENDOR_KVM		0x4d564b2e584454 /* "TDX.KVM" */
+#define EXIT_REASON_IO_INSTRUCTION	30
+/*
+ * Current size of struct tdvmcall_output is 40 bytes,
+ * but allocate double to account future changes.
+ */
+#define TDVMCALL_OUTPUT_SIZE		80
 
 /*
  * TDX guests use the TDCALL instruction to make
@@ -205,3 +212,150 @@ SYM_FUNC_START(__tdvmcall_vendor_kvm)
 	call do_tdvmcall
 	retq
 SYM_FUNC_END(__tdvmcall_vendor_kvm)
+
+.macro io_save_registers
+	push %rbp
+	push %rbx
+	push %rcx
+	push %rdx
+	push %rdi
+	push %rsi
+	push %r8
+	push %r9
+	push %r10
+	push %r11
+	push %r12
+	push %r13
+	push %r14
+	push %r15
+.endm
+.macro io_restore_registers
+	pop %r15
+	pop %r14
+	pop %r13
+	pop %r12
+	pop %r11
+	pop %r10
+	pop %r9
+	pop %r8
+	pop %rsi
+	pop %rdi
+	pop %rdx
+	pop %rcx
+	pop %rbx
+	pop %rbp
+.endm
+
+/*
+ * tdg_out{b,w,l}()  - Write given data to the specified port.
+ *
+ * @arg1 (RAX)       - Value to be written (passed via R8 to do_tdvmcall()).
+ * @arg2 (RDX)       - Port id (passed via RCX to do_tdvmcall()).
+ *
+ */
+SYM_FUNC_START(tdg_outb)
+	io_save_registers
+	xor %r8, %r8
+	/* Move data to R8 register */
+	mov %al, %r8b
+	/* Set data width to 1 byte */
+	mov $1, %rsi
+	jmp 1f
+
+SYM_FUNC_START(tdg_outw)
+	io_save_registers
+	xor %r8, %r8
+	/* Move data to R8 register */
+	mov %ax, %r8w
+	/* Set data width to 2 bytes */
+	mov $2, %rsi
+	jmp 1f
+
+SYM_FUNC_START(tdg_outl)
+	io_save_registers
+	xor %r8, %r8
+	/* Move data to R8 register */
+	mov %eax, %r8d
+	/* Set data width to 4 bytes */
+	mov $4, %rsi
+1:
+	/*
+	 * Since io_save_registers does not save rax
+	 * state, save it here so that we can preserve
+	 * the caller register state.
+	 */
+	push %rax
+
+	mov %rdx, %rcx
+	/* Set 1 in RDX to select out operation */
+	mov $1, %rdx
+	/* Set TDVMCALL function id in RDI */
+	mov $EXIT_REASON_IO_INSTRUCTION, %rdi
+	/* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+	xor %r10, %r10
+	/* Since we don't use tdvmcall output, set it to NULL */
+	xor %r9, %r9
+
+	call do_tdvmcall
+
+	pop %rax
+	io_restore_registers
+	ret
+SYM_FUNC_END(tdg_outb)
+SYM_FUNC_END(tdg_outw)
+SYM_FUNC_END(tdg_outl)
+EXPORT_SYMBOL(tdg_outb)
+EXPORT_SYMBOL(tdg_outw)
+EXPORT_SYMBOL(tdg_outl)
+
+/*
+ * tdg_in{b,w,l}()   - Read data to the specified port.
+ *
+ * @arg1 (RDX)       - Port id (passed via RCX to do_tdvmcall()).
+ *
+ * Returns data read via RAX register.
+ *
+ */
+SYM_FUNC_START(tdg_inb)
+	io_save_registers
+	/* Set data width to 1 byte */
+	mov $1, %rsi
+	jmp 1f
+
+SYM_FUNC_START(tdg_inw)
+	io_save_registers
+	/* Set data width to 2 bytes */
+	mov $2, %rsi
+	jmp 1f
+
+SYM_FUNC_START(tdg_inl)
+	io_save_registers
+	/* Set data width to 4 bytes */
+	mov $4, %rsi
+1:
+	mov %rdx, %rcx
+	/* Set 0 in RDX to select in operation */
+	mov $0, %rdx
+	/* Set TDVMCALL function id in RDI */
+	mov $EXIT_REASON_IO_INSTRUCTION, %rdi
+	/* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+	xor %r10, %r10
+	/* Allocate memory in stack for Output */
+	subq $TDVMCALL_OUTPUT_SIZE, %rsp
+	/* Move tdvmcall_output pointer to R9 */
+	movq %rsp, %r9
+
+	call do_tdvmcall
+
+	/* Move data read from port to RAX */
+	mov TDVMCALL_r11(%r9), %eax
+	/* Free allocated memory */
+	addq $TDVMCALL_OUTPUT_SIZE, %rsp
+	io_restore_registers
+	ret
+SYM_FUNC_END(tdg_inb)
+SYM_FUNC_END(tdg_inw)
+SYM_FUNC_END(tdg_inl)
+EXPORT_SYMBOL(tdg_inb)
+EXPORT_SYMBOL(tdg_inw)
+EXPORT_SYMBOL(tdg_inl)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e42e260df245..ec61f2f06c98 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -189,6 +189,36 @@ static void tdg_handle_cpuid(struct pt_regs *regs)
 	regs->dx = out.r15;
 }
 
+static void tdg_out(int size, int port, unsigned int value)
+{
+	tdvmcall(EXIT_REASON_IO_INSTRUCTION, size, 1, port, value);
+}
+
+static unsigned int tdg_in(int size, int port)
+{
+	return tdvmcall_out_r11(EXIT_REASON_IO_INSTRUCTION, size, 0, port, 0);
+}
+
+static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
+{
+	bool string = exit_qual & 16;
+	int out, size, port;
+
+	/* I/O strings ops are unrolled at build time. */
+	BUG_ON(string);
+
+	out = (exit_qual & 8) ? 0 : 1;
+	size = (exit_qual & 7) + 1;
+	port = exit_qual >> 16;
+
+	if (out) {
+		tdg_out(size, port, regs->ax);
+	} else {
+		regs->ax &= ~GENMASK(8 * size, 0);
+		regs->ax |= tdg_in(size, port) & GENMASK(8 * size, 0);
+	}
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -238,6 +268,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_CPUID:
 		tdg_handle_cpuid(regs);
 		break;
+	case EXIT_REASON_IO_INSTRUCTION:
+		tdg_handle_io(regs, ve->exit_qual);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 15/32] x86/tdx: Handle in-kernel MMIO
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (13 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 14/32] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-07 21:52   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Kuppuswamy Sathyanarayanan
                   ` (17 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
exit reason.

For now we only handle subset of instruction that kernel uses for MMIO
oerations. User-space access triggers SIGBUS.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 100 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ec61f2f06c98..3fe617978fc4 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -5,6 +5,8 @@
 
 #include <asm/tdx.h>
 #include <asm/vmx.h>
+#include <asm/insn.h>
+#include <linux/sched/signal.h> /* force_sig_fault() */
 
 #include <linux/cpu.h>
 
@@ -219,6 +221,101 @@ static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
 	}
 }
 
+static unsigned long tdg_mmio(int size, bool write, unsigned long addr,
+		unsigned long val)
+{
+	return tdvmcall_out_r11(EXIT_REASON_EPT_VIOLATION, size,
+				write, addr, val);
+}
+
+static inline void *get_reg_ptr(struct pt_regs *regs, struct insn *insn)
+{
+	static const int regoff[] = {
+		offsetof(struct pt_regs, ax),
+		offsetof(struct pt_regs, cx),
+		offsetof(struct pt_regs, dx),
+		offsetof(struct pt_regs, bx),
+		offsetof(struct pt_regs, sp),
+		offsetof(struct pt_regs, bp),
+		offsetof(struct pt_regs, si),
+		offsetof(struct pt_regs, di),
+		offsetof(struct pt_regs, r8),
+		offsetof(struct pt_regs, r9),
+		offsetof(struct pt_regs, r10),
+		offsetof(struct pt_regs, r11),
+		offsetof(struct pt_regs, r12),
+		offsetof(struct pt_regs, r13),
+		offsetof(struct pt_regs, r14),
+		offsetof(struct pt_regs, r15),
+	};
+	int regno;
+
+	regno = X86_MODRM_REG(insn->modrm.value);
+	if (X86_REX_R(insn->rex_prefix.value))
+		regno += 8;
+
+	return (void *)regs + regoff[regno];
+}
+
+static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+	int size;
+	bool write;
+	unsigned long *reg;
+	struct insn insn;
+	unsigned long val = 0;
+
+	/*
+	 * User mode would mean the kernel exposed a device directly
+	 * to ring3, which shouldn't happen except for things like
+	 * DPDK.
+	 */
+	if (user_mode(regs)) {
+		pr_err("Unexpected user-mode MMIO access.\n");
+		force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);
+		return 0;
+	}
+
+	kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
+	insn_get_length(&insn);
+	insn_get_opcode(&insn);
+
+	write = ve->exit_qual & 0x2;
+
+	size = insn.opnd_bytes;
+	switch (insn.opcode.bytes[0]) {
+	/* MOV r/m8	r8	*/
+	case 0x88:
+	/* MOV r8	r/m8	*/
+	case 0x8A:
+	/* MOV r/m8	imm8	*/
+	case 0xC6:
+		size = 1;
+		break;
+	}
+
+	if (inat_has_immediate(insn.attr)) {
+		BUG_ON(!write);
+		val = insn.immediate.value;
+		tdg_mmio(size, write, ve->gpa, val);
+		return insn.length;
+	}
+
+	BUG_ON(!inat_has_modrm(insn.attr));
+
+	reg = get_reg_ptr(regs, &insn);
+
+	if (write) {
+		memcpy(&val, reg, size);
+		tdg_mmio(size, write, ve->gpa, val);
+	} else {
+		val = tdg_mmio(size, write, ve->gpa, val);
+		memset(reg, 0, size);
+		memcpy(reg, &val, size);
+	}
+	return insn.length;
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -271,6 +368,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_IO_INSTRUCTION:
 		tdg_handle_io(regs, ve->exit_qual);
 		break;
+	case EXIT_REASON_EPT_VIOLATION:
+		ve->instr_len = tdg_handle_mmio(regs, ve);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (14 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 15/32] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-11  1:23   ` Dan Williams
  2021-05-11 15:53   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 17/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Structure Kuppuswamy Sathyanarayanan
                   ` (16 subsequent siblings)
  32 siblings, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

When running as a TDX guest, there are a number of existing,
privileged instructions that do not work. If the guest kernel
uses these instructions, the hardware generates a #VE.

You can find the list of unsupported instructions in Intel
Trust Domain Extensions (Intel® TDX) Module specification,
sec 9.2.2 and in Guest-Host Communication Interface (GHCI)
Specification for Intel TDX, sec 2.4.1.
   
To prevent TD guest from using MWAIT/MONITOR instructions,
support for these instructions are already disabled by TDX
module (SEAM). So CPUID flags for these instructions should
be in disabled state.

After the above mentioned preventive measures, if TD guests still
execute these instructions, add appropriate warning messages in #VE
handler. For WBIND instruction, since it's related to memory writeback
and cache flushes, it's mainly used in context of IO devices. Since
TDX 1.0 does not support non-virtual I/O devices, skipping it should
not cause any fatal issues. But to let users know about its usage, use
WARN() to report about it.. For MWAIT/MONITOR instruction, since its
unsupported use WARN() to report unsupported usage.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 3fe617978fc4..294dda5bf3f6 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -371,6 +371,21 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_EPT_VIOLATION:
 		ve->instr_len = tdg_handle_mmio(regs, ve);
 		break;
+	case EXIT_REASON_WBINVD:
+		/*
+		 * WBINVD is not supported inside TDX guests. All in-
+		 * kernel uses should have been disabled.
+		 */
+		WARN_ONCE(1, "TD Guest used unsupported WBINVD instruction\n");
+		break;
+	case EXIT_REASON_MONITOR_INSTRUCTION:
+	case EXIT_REASON_MWAIT_INSTRUCTION:
+		/*
+		 * Something in the kernel used MONITOR or MWAIT despite
+		 * X86_FEATURE_MWAIT being cleared for TDX guests.
+		 */
+		WARN_ONCE(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n");
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 17/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Structure
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (15 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 18/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Mailbox Structure Kuppuswamy Sathyanarayanan
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Erik Kaneda,
	Bob Moore, Rafael J . Wysocki

From: Erik Kaneda <erik.kaneda@intel.com>

ACPICA commit b9eb6f3a19b816824d6f47a6bc86fd8ce690e04b

Link: https://github.com/acpica/acpica/commit/b9eb6f3a
Signed-off-by: Erik Kaneda <erik.kaneda@intel.com>
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 include/acpi/actbl2.h | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index d6478c430c99..b2362600b9ff 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -516,7 +516,8 @@ enum acpi_madt_type {
 	ACPI_MADT_TYPE_GENERIC_MSI_FRAME = 13,
 	ACPI_MADT_TYPE_GENERIC_REDISTRIBUTOR = 14,
 	ACPI_MADT_TYPE_GENERIC_TRANSLATOR = 15,
-	ACPI_MADT_TYPE_RESERVED = 16	/* 16 and greater are reserved */
+	ACPI_MADT_TYPE_MULTIPROC_WAKEUP = 16,
+	ACPI_MADT_TYPE_RESERVED = 17	/* 17 and greater are reserved */
 };
 
 /*
@@ -723,6 +724,15 @@ struct acpi_madt_generic_translator {
 	u32 reserved2;
 };
 
+/* 16: Multiprocessor wakeup (ACPI 6.4) */
+
+struct acpi_madt_multiproc_wakeup {
+	struct acpi_subtable_header header;
+	u16 mailbox_version;
+	u32 reserved;		/* reserved - must be zero */
+	u64 base_address;
+};
+
 /*
  * Common flags fields for MADT subtables
  */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 18/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Mailbox Structure
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (16 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 17/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Structure Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 19/32] ACPI/table: Print MADT Wake table information Kuppuswamy Sathyanarayanan
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

ACPICA commit f1ee04207a212f6c519441e7e25397649ebc4cea

Add Multiprocessor Wakeup Mailbox Structure definition. It is useful
in parsing MADT Wake table.

Link: https://github.com/acpica/acpica/commit/f1ee0420
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 include/acpi/actbl2.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index b2362600b9ff..7dce422f6119 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -733,6 +733,20 @@ struct acpi_madt_multiproc_wakeup {
 	u64 base_address;
 };
 
+#define ACPI_MULTIPROC_WAKEUP_MB_OS_SIZE	2032
+#define ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE	2048
+
+struct acpi_madt_multiproc_wakeup_mailbox {
+	u16 command;
+	u16 reserved;		/* reserved - must be zero */
+	u32 apic_id;
+	u64 wakeup_vector;
+	u8 reserved_os[ACPI_MULTIPROC_WAKEUP_MB_OS_SIZE];	/* reserved for OS use */
+	u8 reserved_firmware[ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE];	/* reserved for firmware use */
+};
+
+#define ACPI_MP_WAKE_COMMAND_WAKEUP    1
+
 /*
  * Common flags fields for MADT subtables
  */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 19/32] ACPI/table: Print MADT Wake table information
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (17 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 18/32] ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Mailbox Structure Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 20/32] x86/acpi, x86/boot: Add multiprocessor wake-up support Kuppuswamy Sathyanarayanan
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan, Rafael J . Wysocki

When MADT is parsed, print MADT Wake table information as
debug message. It will be useful to debug CPU boot issues
related to MADT wake table.

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/acpi/tables.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 9d581045acff..206df4ad8b2b 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -207,6 +207,17 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
 		}
 		break;
 
+	case ACPI_MADT_TYPE_MULTIPROC_WAKEUP:
+		{
+			struct acpi_madt_multiproc_wakeup *p;
+
+			p = (struct acpi_madt_multiproc_wakeup *) header;
+
+			pr_debug("MP Wake (Mailbox version[%d] base_address[%llx])\n",
+				 p->mailbox_version, p->base_address);
+		}
+		break;
+
 	default:
 		pr_warn("Found unsupported MADT entry (type = 0x%x)\n",
 			header->type);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 20/32] x86/acpi, x86/boot: Add multiprocessor wake-up support
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (18 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 19/32] ACPI/table: Print MADT Wake table information Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 21/32] x86/boot: Add a trampoline for APs booting in 64-bit mode Kuppuswamy Sathyanarayanan
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan, Sean Christopherson

As per ACPI specification r6.4, sec 5.2.12.19, a new sub
structure – multiprocessor wake-up structure - is added to the
ACPI Multiple APIC Description Table (MADT) to describe the
information of the mailbox. If a platform firmware produces the
multiprocessor wake-up structure, then OS may use this new
mailbox-based mechanism to wake up the APs.

Add ACPI MADT wake table parsing support for x86 platform and if
MADT wake table is present, update apic->wakeup_secondary_cpu with
new API which uses MADT wake mailbox to wake-up CPU.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/apic.h |  3 ++
 arch/x86/kernel/acpi/boot.c | 79 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/apic/apic.c |  8 ++++
 3 files changed, 90 insertions(+)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 412b51e059c8..3e94e1f402ea 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -487,6 +487,9 @@ static inline unsigned int read_apic_id(void)
 	return apic->get_apic_id(reg);
 }
 
+typedef int (*wakeup_cpu_handler)(int apicid, unsigned long start_eip);
+extern void acpi_wake_cpu_handler_update(wakeup_cpu_handler handler);
+
 extern int default_apic_id_valid(u32 apicid);
 extern int default_acpi_madt_oem_check(char *, char *);
 extern void default_setup_apic_routing(void);
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 14cd3186dc77..fce2aa7d718f 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -65,6 +65,9 @@ int acpi_fix_pin2_polarity __initdata;
 static u64 acpi_lapic_addr __initdata = APIC_DEFAULT_PHYS_BASE;
 #endif
 
+static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
+static u64 acpi_mp_wake_mailbox_paddr;
+
 #ifdef CONFIG_X86_IO_APIC
 /*
  * Locks related to IOAPIC hotplug
@@ -329,6 +332,52 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e
 	return 0;
 }
 
+static void acpi_mp_wake_mailbox_init(void)
+{
+	if (acpi_mp_wake_mailbox)
+		return;
+
+	acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
+			sizeof(*acpi_mp_wake_mailbox), MEMREMAP_WB);
+}
+
+static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
+{
+	u8 timeout = 0xFF;
+
+	acpi_mp_wake_mailbox_init();
+
+	if (!acpi_mp_wake_mailbox)
+		return -EINVAL;
+
+	/*
+	 * Mailbox memory is shared between firmware and OS. Firmware will
+	 * listen on mailbox command address, and once it receives the wakeup
+	 * command, CPU associated with the given apicid will be booted. So,
+	 * the value of apic_id and wakeup_vector has to be set before updating
+	 * the wakeup command. So use WRITE_ONCE to let the compiler know about
+	 * it and preserve the order of writes.
+	 */
+	WRITE_ONCE(acpi_mp_wake_mailbox->apic_id, apicid);
+	WRITE_ONCE(acpi_mp_wake_mailbox->wakeup_vector, start_ip);
+	WRITE_ONCE(acpi_mp_wake_mailbox->command, ACPI_MP_WAKE_COMMAND_WAKEUP);
+
+	/*
+	 * After writing wakeup command, wait for maximum timeout of 0xFF
+	 * for firmware to reset the command address back zero to indicate
+	 * the successful reception of command.
+	 * NOTE: 255 as timeout value is decided based on our experiments.
+	 *
+	 * XXX: Change the timeout once ACPI specification comes up with
+	 *      standard maximum timeout value.
+	 */
+	while (READ_ONCE(acpi_mp_wake_mailbox->command) && timeout--)
+		cpu_relax();
+
+	/* If timedout, return error */
+	return timeout ? 0 : -EIO;
+}
+
 #endif				/*CONFIG_X86_LOCAL_APIC */
 
 #ifdef CONFIG_X86_IO_APIC
@@ -1086,6 +1135,30 @@ static int __init acpi_parse_madt_lapic_entries(void)
 	}
 	return 0;
 }
+
+static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+				      const unsigned long end)
+{
+	struct acpi_madt_multiproc_wakeup *mp_wake;
+
+	if (acpi_mp_wake_mailbox)
+		return -EINVAL;
+
+	if (!IS_ENABLED(CONFIG_SMP))
+		return -ENODEV;
+
+	mp_wake = (struct acpi_madt_multiproc_wakeup *) header;
+	if (BAD_MADT_ENTRY(mp_wake, end))
+		return -EINVAL;
+
+	acpi_table_print_madt_entry(&header->common);
+
+	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+
+	acpi_wake_cpu_handler_update(acpi_wakeup_cpu);
+
+	return 0;
+}
 #endif				/* CONFIG_X86_LOCAL_APIC */
 
 #ifdef	CONFIG_X86_IO_APIC
@@ -1284,6 +1357,12 @@ static void __init acpi_process_madt(void)
 
 				smp_found_config = 1;
 			}
+
+			/*
+			 * Parse MADT MP Wake entry.
+			 */
+			acpi_table_parse_madt(ACPI_MADT_TYPE_MULTIPROC_WAKEUP,
+					      acpi_parse_mp_wake, 1);
 		}
 		if (error == -EINVAL) {
 			/*
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 4f26700f314d..f1b90a4b89e8 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2554,6 +2554,14 @@ u32 x86_msi_msg_get_destid(struct msi_msg *msg, bool extid)
 }
 EXPORT_SYMBOL_GPL(x86_msi_msg_get_destid);
 
+void __init acpi_wake_cpu_handler_update(wakeup_cpu_handler handler)
+{
+	struct apic **drv;
+
+	for (drv = __apicdrivers; drv < __apicdrivers_end; drv++)
+		(*drv)->wakeup_secondary_cpu = handler;
+}
+
 /*
  * Override the generic EOI implementation with an optimized version.
  * Only called during early boot when only one CPU is active and with
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 21/32] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (19 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 20/32] x86/acpi, x86/boot: Add multiprocessor wake-up support Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-13  2:56   ` Dan Williams
  2021-04-26 18:01 ` [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot for TDX platforms Kuppuswamy Sathyanarayanan
                   ` (11 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kai Huang, Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a trampoline for booting APs in 64-bit mode via a software handoff
with BIOS, and use the new trampoline for the ACPI MP wake protocol used
by TDX.

Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
mode.  For the GDT pointer, create a new entry as the existing storage
for the pointer occupies the zero entry in the GDT itself.

Reported-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/realmode.h          |  1 +
 arch/x86/kernel/smpboot.c                |  5 +++
 arch/x86/realmode/rm/header.S            |  1 +
 arch/x86/realmode/rm/trampoline_64.S     | 49 +++++++++++++++++++++++-
 arch/x86/realmode/rm/trampoline_common.S |  5 ++-
 5 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 5db5d083c873..5066c8b35e7c 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -25,6 +25,7 @@ struct real_mode_header {
 	u32	sev_es_trampoline_start;
 #endif
 #ifdef CONFIG_X86_64
+	u32	trampoline_start64;
 	u32	trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 16703c35a944..27d8491d753a 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1036,6 +1036,11 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
 	unsigned long boot_error = 0;
 	unsigned long timeout;
 
+#ifdef CONFIG_X86_64
+	if (is_tdx_guest())
+		start_ip = real_mode_header->trampoline_start64;
+#endif
+
 	idle->thread.sp = (unsigned long)task_pt_regs(idle);
 	early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
 	initial_code = (unsigned long)start_secondary;
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 8c1db5bf5d78..2eb62be6d256 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
 	.long	pa_sev_es_trampoline_start
 #endif
 #ifdef CONFIG_X86_64
+	.long	pa_trampoline_start64
 	.long	pa_trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 84c5d1b33d10..12b734b1da8b 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,20 @@ SYM_CODE_START(startup_32)
 	movl	%eax, %cr3
 
 	# Set up EFER
+	movl	$MSR_EFER, %ecx
+	rdmsr
+	cmp	pa_tr_efer, %eax
+	jne	.Lwrite_efer
+	cmp	pa_tr_efer + 4, %edx
+	je	.Ldone_efer
+.Lwrite_efer:
 	movl	pa_tr_efer, %eax
 	movl	pa_tr_efer + 4, %edx
-	movl	$MSR_EFER, %ecx
 	wrmsr
 
+.Ldone_efer:
 	# Enable paging and in turn activate Long Mode
-	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	/*
@@ -161,6 +168,19 @@ SYM_CODE_START(startup_32)
 	ljmpl	$__KERNEL_CS, $pa_startup_64
 SYM_CODE_END(startup_32)
 
+SYM_CODE_START(pa_trampoline_compat)
+	/*
+	 * In compatibility mode.  Prep ESP and DX for startup_32, then disable
+	 * paging and complete the switch to legacy 32-bit mode.
+	 */
+	movl	$rm_stack_end, %esp
+	movw	$__KERNEL_DS, %dx
+
+	movl	$(X86_CR0_NE | X86_CR0_PE), %eax
+	movl	%eax, %cr0
+	ljmpl   $__KERNEL32_CS, $pa_startup_32
+SYM_CODE_END(pa_trampoline_compat)
+
 	.section ".text64","ax"
 	.code64
 	.balign 4
@@ -169,6 +189,20 @@ SYM_CODE_START(startup_64)
 	jmpq	*tr_start(%rip)
 SYM_CODE_END(startup_64)
 
+SYM_CODE_START(trampoline_start64)
+	/*
+	 * APs start here on a direct transfer from 64-bit BIOS with identity
+	 * mapped page tables.  Load the kernel's GDT in order to gear down to
+	 * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
+	 * segment registers.  Load the zero IDT so any fault triggers a
+	 * shutdown instead of jumping back into BIOS.
+	 */
+	lidt	tr_idt(%rip)
+	lgdt	tr_gdt64(%rip)
+
+	ljmpl	*tr_compat(%rip)
+SYM_CODE_END(trampoline_start64)
+
 	.section ".rodata","a"
 	# Duplicate the global descriptor table
 	# so the kernel can live anywhere
@@ -182,6 +216,17 @@ SYM_DATA_START(tr_gdt)
 	.quad	0x00cf93000000ffff	# __KERNEL_DS
 SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
 
+SYM_DATA_START(tr_gdt64)
+	.short	tr_gdt_end - tr_gdt - 1	# gdt limit
+	.long	pa_tr_gdt
+	.long	0
+SYM_DATA_END(tr_gdt64)
+
+SYM_DATA_START(tr_compat)
+	.long	pa_trampoline_compat
+	.short	__KERNEL32_CS
+SYM_DATA_END(tr_compat)
+
 	.bss
 	.balign	PAGE_SIZE
 SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
index 5033e640f957..506d5897112a 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -1,4 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 	.section ".rodata","a"
 	.balign	16
-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+SYM_DATA_START_LOCAL(tr_idt)
+	.short	0
+	.quad	0
+SYM_DATA_END(tr_idt)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot for TDX platforms
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (20 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 21/32] x86/boot: Add a trampoline for APs booting in 64-bit mode Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-13  3:03   ` Dan Williams
  2021-04-26 18:01 ` [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process Kuppuswamy Sathyanarayanan
                   ` (10 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Avoid operations which will inject #VE during compressed
boot, which is obviously fatal for TDX platforms.

Details are,

 1. TDX module injects #VE if a TDX guest attempts to write
    EFER. So skip the WRMSR to set EFER.LME=1 if it's already
    set. TDX also forces EFER.LME=1, i.e. the branch will always
    be taken and thus the #VE avoided.

 2. TDX module also injects a #VE if the guest attempts to clear
    CR0.NE. Ensure CR0.NE is set when loading CR0 during compressed
    boot. The Setting CR0.NE should be a nop on all CPUs that
    support 64-bit mode.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S | 5 +++--
 arch/x86/boot/compressed/pgtable.h | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index e94874f4bbc1..37c2f37d4a0d 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -616,8 +616,9 @@ SYM_CODE_START(trampoline_32bit_src)
 	movl	$MSR_EFER, %ecx
 	rdmsr
 	btsl	$_EFER_LME, %eax
+	jc	1f
 	wrmsr
-	popl	%edx
+1:	popl	%edx
 	popl	%ecx
 
 	/* Enable PAE and LA57 (if required) paging modes */
@@ -636,7 +637,7 @@ SYM_CODE_START(trampoline_32bit_src)
 	pushl	%eax
 
 	/* Enable paging again */
-	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
+	movl	$(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	lret
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
 
 #define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE	0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE	0x80
 
 #define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (21 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot for TDX platforms Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-13  3:23   ` Dan Williams
  2021-04-26 18:01 ` [RFC v2 24/32] x86/topology: Disable CPU online/offline control for TDX guest Kuppuswamy Sathyanarayanan
                   ` (9 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Skip writing EFER during secondary_startup_64() if the current value is
also the desired value. This avoids a #VE when running as a TDX guest,
as the TDX-Module does not allow writes to EFER (even when writing the
current, fixed value).

Also, preserve CR4.MCE instead of clearing it during boot to avoid a #VE
when running as a TDX guest. The TDX-Module (effectively part of the
hypervisor) requires CR4.MCE to be set at all times and injects a #VE
if the guest attempts to clear CR4.MCE.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S |  5 ++++-
 arch/x86/kernel/head_64.S          | 13 +++++++++++--
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 37c2f37d4a0d..2d79e5f97360 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -622,7 +622,10 @@ SYM_CODE_START(trampoline_32bit_src)
 	popl	%ecx
 
 	/* Enable PAE and LA57 (if required) paging modes */
-	movl	$X86_CR4_PAE, %eax
+	movl	%cr4, %eax
+	/* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
+	andl	$X86_CR4_MCE, %eax
+	orl	$X86_CR4_PAE, %eax
 	testl	%edx, %edx
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..92c77cf75542 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 1:
 
 	/* Enable PAE mode, PGE and LA57 */
-	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+	movq	%cr4, %rcx
+	/* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
+	andl	$X86_CR4_MCE, %ecx
+	orl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
 #ifdef CONFIG_X86_5LEVEL
 	testl	$1, __pgtable_l5_enabled(%rip)
 	jz	1f
@@ -229,13 +232,19 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	/* Setup EFER (Extended Feature Enable Register) */
 	movl	$MSR_EFER, %ecx
 	rdmsr
+	movl    %eax, %edx
 	btsl	$_EFER_SCE, %eax	/* Enable System Call */
 	btl	$20,%edi		/* No Execute supported? */
 	jnc     1f
 	btsl	$_EFER_NX, %eax
 	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
-1:	wrmsr				/* Make changes effective */
 
+	/* Skip the WRMSR if the current value matches the desired value. */
+1:	cmpl	%edx, %eax
+	je	1f
+	xor	%edx, %edx
+	wrmsr				/* Make changes effective */
+1:
 	/* Setup cr0 */
 	movl	$CR0_STATE, %eax
 	/* Make changes effective */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 24/32] x86/topology: Disable CPU online/offline control for TDX guest
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (22 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 25/32] x86/tdx: Forcefully disable legacy PIC for TDX guests Kuppuswamy Sathyanarayanan
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

As per Intel TDX Virtual Firmware Design Guide, sec 4.3.5 and
sec 9.4, all unused CPUs are put in spinning state by
TDVF until OS requests for CPU bring-up via mailbox address passed
by ACPI MADT table. Since by default all unused CPUs are always in
spinning state, there is no point in supporting dynamic CPU
online/offline feature. So current generation of TDVF does not
support CPU hotplug feature. It may be supported in next generation.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/tdx.c      | 14 ++++++++++++++
 arch/x86/kernel/topology.c |  3 ++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 294dda5bf3f6..ab1efa4d10e9 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -316,6 +316,17 @@ static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 	return insn.length;
 }
 
+static int tdg_cpu_offline_prepare(unsigned int cpu)
+{
+	/*
+	 * Per Intel TDX Virtual Firmware Design Guide,
+	 * sec 4.3.5 and sec 9.4, Hotplug is not supported
+	 * in TDX platforms. So don't support CPU
+	 * offline feature once its turned on.
+	 */
+	return -EOPNOTSUPP;
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -410,5 +421,8 @@ void __init tdx_early_init(void)
 	pv_ops.irq.safe_halt = tdg_safe_halt;
 	pv_ops.irq.halt = tdg_halt;
 
+	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdg:cpu_hotplug",
+			  NULL, tdg_cpu_offline_prepare);
+
 	pr_info("TDX guest is initialized\n");
 }
diff --git a/arch/x86/kernel/topology.c b/arch/x86/kernel/topology.c
index f5477eab5692..d879ea96d79c 100644
--- a/arch/x86/kernel/topology.c
+++ b/arch/x86/kernel/topology.c
@@ -34,6 +34,7 @@
 #include <linux/irq.h>
 #include <asm/io_apic.h>
 #include <asm/cpu.h>
+#include <asm/tdx.h>
 
 static DEFINE_PER_CPU(struct x86_cpu, cpu_devices);
 
@@ -130,7 +131,7 @@ int arch_register_cpu(int num)
 			}
 		}
 	}
-	if (num || cpu0_hotpluggable)
+	if ((num || cpu0_hotpluggable) && !is_tdx_guest())
 		per_cpu(cpu_devices, num).cpu.hotpluggable = 1;
 
 	return register_cpu(&per_cpu(cpu_devices, num).cpu, num);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 25/32] x86/tdx: Forcefully disable legacy PIC for TDX guests
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (23 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 24/32] x86/topology: Disable CPU online/offline control for TDX guest Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code Kuppuswamy Sathyanarayanan
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Disable the legacy PIC (8259) for TDX guests as the PIC cannot be
supported by the VMM. TDX Module does not allow direct IRQ injection,
and using posted interrupt style delivery requires the guest to EOI
the IRQ, which diverges from the legacy PIC behavior.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ab1efa4d10e9..1f1bb98e1d38 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -4,6 +4,7 @@
 #define pr_fmt(fmt) "TDX: " fmt
 
 #include <asm/tdx.h>
+#include <asm/i8259.h>
 #include <asm/vmx.h>
 #include <asm/insn.h>
 #include <linux/sched/signal.h> /* force_sig_fault() */
@@ -421,6 +422,8 @@ void __init tdx_early_init(void)
 	pv_ops.irq.safe_halt = tdg_safe_halt;
 	pv_ops.irq.halt = tdg_halt;
 
+	legacy_pic = &null_legacy_pic;
+
 	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdg:cpu_hotplug",
 			  NULL, tdg_cpu_offline_prepare);
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (24 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 25/32] x86/tdx: Forcefully disable legacy PIC for TDX guests Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-07 21:54   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Kuppuswamy Sathyanarayanan
                   ` (6 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Intel TDX doesn't allow VMM to access guest memory. Any memory
that is required for communication with VMM must be shared
explicitly by setting the bit in page table entry. And, after
setting the shared bit, the conversion must be completed with
MapGPA TDVMALL. The call informs VMM about the conversion and
makes it remove the GPA from the S-EPT mapping. The shared
memory is similar to unencrypted memory in AMD SME/SEV terminology
but the underlying process of sharing/un-sharing the memory is
different for Intel TDX guest platform.

SEV assumes that I/O devices can only do DMA to "decrypted"
physical addresses without the C-bit set.  In order for the CPU
to interact with this memory, the CPU needs a decrypted mapping.
To add this support, AMD SME code forces force_dma_unencrypted()
to return true for platforms that support AMD SEV feature. It will
be used for DMA memory allocation API to trigger
set_memory_decrypted() for platforms that support AMD SEV feature.

TDX is similar.  TDX architecturally prevents access to private
guest memory by anything other than the guest itself. This means that
any DMA buffers must be shared.

So move force_dma_unencrypted() out of AMD specific code.
   
It will be modified to return true for Intel TDX guest platform,
similar to AMD SEV feature.

Introduce new config option X86_MEM_ENCRYPT_COMMON that has to be
selected by all x86 memory encryption features. This will be
selected by both AMD SEV and Intel TDX guest config options.

This is preparation for TDX changes in DMA code and it has not
functional change.    

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/Kconfig                 |  8 +++++--
 arch/x86/mm/Makefile             |  2 ++
 arch/x86/mm/mem_encrypt.c        | 30 -------------------------
 arch/x86/mm/mem_encrypt_common.c | 38 ++++++++++++++++++++++++++++++++
 4 files changed, 46 insertions(+), 32 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 932e6d759ba7..67f99bf27729 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1529,14 +1529,18 @@ config X86_CPA_STATISTICS
 	  helps to determine the effectiveness of preserving large and huge
 	  page mappings when mapping protections are changed.
 
+config X86_MEM_ENCRYPT_COMMON
+	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+	select DYNAMIC_PHYSICAL_MASK
+	def_bool n
+
 config AMD_MEM_ENCRYPT
 	bool "AMD Secure Memory Encryption (SME) support"
 	depends on X86_64 && CPU_SUP_AMD
 	select DMA_COHERENT_POOL
-	select DYNAMIC_PHYSICAL_MASK
 	select ARCH_USE_MEMREMAP_PROT
-	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
 	select INSTRUCTION_DECODER
+	select X86_MEM_ENCRYPT_COMMON
 	help
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..b31cb52bf1bd 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,8 @@ obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
 obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
 
+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON)	+= mem_encrypt_common.o
+
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index ae78cef79980..6f713c6a32b2 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -15,10 +15,6 @@
 #include <linux/dma-direct.h>
 #include <linux/swiotlb.h>
 #include <linux/mem_encrypt.h>
-#include <linux/device.h>
-#include <linux/kernel.h>
-#include <linux/bitops.h>
-#include <linux/dma-mapping.h>
 
 #include <asm/tlbflush.h>
 #include <asm/fixmap.h>
@@ -390,32 +386,6 @@ bool noinstr sev_es_active(void)
 	return sev_status & MSR_AMD64_SEV_ES_ENABLED;
 }
 
-/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
-bool force_dma_unencrypted(struct device *dev)
-{
-	/*
-	 * For SEV, all DMA must be to unencrypted addresses.
-	 */
-	if (sev_active())
-		return true;
-
-	/*
-	 * For SME, all DMA must be to unencrypted addresses if the
-	 * device does not support DMA to addresses that include the
-	 * encryption mask.
-	 */
-	if (sme_active()) {
-		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
-		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
-						dev->bus_dma_limit);
-
-		if (dma_dev_mask <= dma_enc_mask)
-			return true;
-	}
-
-	return false;
-}
-
 void __init mem_encrypt_free_decrypted_mem(void)
 {
 	unsigned long vaddr, vaddr_end, npages;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..964e04152417
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2016 Advanced Micro Devices, Inc.
+ *
+ * Author: Tom Lendacky <thomas.lendacky@amd.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/mem_encrypt.h>
+#include <linux/dma-mapping.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+	/*
+	 * For SEV, all DMA must be to unencrypted/shared addresses.
+	 */
+	if (sev_active())
+		return true;
+
+	/*
+	 * For SME, all DMA must be to unencrypted addresses if the
+	 * device does not support DMA to addresses that include the
+	 * encryption mask.
+	 */
+	if (sme_active()) {
+		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
+		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
+						dev->bus_dma_limit);
+
+		if (dma_dev_mask <= dma_enc_mask)
+			return true;
+	}
+
+	return false;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (25 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-19  5:00   ` Kuppuswamy, Sathyanarayanan
  2021-05-19 16:14   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 28/32] x86/tdx: Make pages shared in ioremap() Kuppuswamy Sathyanarayanan
                   ` (5 subsequent siblings)
  32 siblings, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

tdx_shared_mask() returns the mask that has to be set in a page
table entry to make page shared with VMM.

Also, note that we cannot club shared mapping configuration between
AMD SME and Intel TDX Guest platforms in common function. SME has
to do it very early in __startup_64() as it sets the bit on all
memory, except what is used for communication. TDX can postpone as
we don't need any shared mapping in very early boot.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/Kconfig           | 1 +
 arch/x86/include/asm/tdx.h | 6 ++++++
 arch/x86/kernel/tdx.c      | 9 +++++++++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 67f99bf27729..5f92e8205de2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -882,6 +882,7 @@ config INTEL_TDX_GUEST
 	select PARAVIRT_XL
 	select X86_X2APIC
 	select SECURITY_LOCKDOWN_LSM
+	select X86_MEM_ENCRYPT_COMMON
 	help
 	  Provide support for running in a trusted domain on Intel processors
 	  equipped with Trusted Domain eXtenstions. TDX is an new Intel
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b972c6531a53..dc80cf7f7d08 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -111,6 +111,8 @@ unsigned char tdg_inb(unsigned short port);
 unsigned short tdg_inw(unsigned short port);
 unsigned int tdg_inl(unsigned short port);
 
+extern phys_addr_t tdg_shared_mask(void);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
@@ -149,6 +151,10 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
 	return -ENODEV;
 }
 
+static inline phys_addr_t tdg_shared_mask(void)
+{
+	return 0;
+}
 #endif /* CONFIG_INTEL_TDX_GUEST */
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 1f1bb98e1d38..7e391cd7aa2b 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -76,6 +76,12 @@ bool is_tdx_guest(void)
 }
 EXPORT_SYMBOL_GPL(is_tdx_guest);
 
+/* The highest bit of a guest physical address is the "sharing" bit */
+phys_addr_t tdg_shared_mask(void)
+{
+	return 1ULL << (td_info.gpa_width - 1);
+}
+
 static void tdg_get_info(void)
 {
 	u64 ret;
@@ -87,6 +93,9 @@ static void tdg_get_info(void)
 
 	td_info.gpa_width = out.rcx & GENMASK(5, 0);
 	td_info.attributes = out.rdx;
+
+	/* Exclude Shared bit from the __PHYSICAL_MASK */
+	physical_mask &= ~tdg_shared_mask();
 }
 
 static __cpuidle void tdg_halt(void)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (26 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-07 21:55   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 29/32] x86/tdx: Add helper to do MapGPA TDVMALL Kuppuswamy Sathyanarayanan
                   ` (4 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

All ioremap()ed pages that are not backed by normal memory (NONE or
RESERVED) have to be mapped as shared.

Reuse the infrastructure we have for AMD SEV.

Note that DMA code doesn't use ioremap() to convert memory to shared as
DMA buffers backed by normal memory. DMA code make buffer shared with
set_memory_decrypted().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h | 3 +++
 arch/x86/mm/ioremap.c          | 8 +++++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..734e775605c0 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -21,6 +21,9 @@
 #define pgprot_encrypted(prot)	__pgprot(__sme_set(pgprot_val(prot)))
 #define pgprot_decrypted(prot)	__pgprot(__sme_clr(pgprot_val(prot)))
 
+/* Make the page accesable by VMM */
+#define pgprot_tdg_shared(prot) __pgprot(pgprot_val(prot) | tdg_shared_mask())
+
 #ifndef __ASSEMBLY__
 #include <asm/x86_init.h>
 #include <asm/fpu/xstate.h>
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 9e5ccc56f8e0..c0dac02f5b3f 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -87,12 +87,12 @@ static unsigned int __ioremap_check_ram(struct resource *res)
 }
 
 /*
- * In a SEV guest, NONE and RESERVED should not be mapped encrypted because
- * there the whole memory is already encrypted.
+ * In a SEV or TDX guest, NONE and RESERVED should not be mapped encrypted (or
+ * private in TDX case) because there the whole memory is already encrypted.
  */
 static unsigned int __ioremap_check_encrypted(struct resource *res)
 {
-	if (!sev_active())
+	if (!sev_active() && !is_tdx_guest())
 		return 0;
 
 	switch (res->desc) {
@@ -244,6 +244,8 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
 	prot = PAGE_KERNEL_IO;
 	if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
 		prot = pgprot_encrypted(prot);
+	else if (is_tdx_guest())
+		prot = pgprot_tdg_shared(prot);
 
 	switch (pcm) {
 	case _PAGE_CACHE_MODE_UC:
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 29/32] x86/tdx: Add helper to do MapGPA TDVMALL
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (27 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 28/32] x86/tdx: Make pages shared in ioremap() Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-19 15:59   ` Dave Hansen
  2021-04-26 18:01 ` [RFC v2 30/32] x86/tdx: Make DMA pages shared Kuppuswamy Sathyanarayanan
                   ` (3 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

MapGPA TDVMCALL requests the host VMM to map a GPA range as private or
shared memory mappings. Shared GPA mappings can be used for
communication beteen TD guest and host VMM, for example for
paravirtualized IO.

The new helper tdx_map_gpa() provides access to the operation.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/tdx.h | 13 +++++++++++++
 arch/x86/kernel/tdx.c      | 13 +++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index dc80cf7f7d08..4789798d7737 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -7,6 +7,11 @@
 
 #ifndef __ASSEMBLY__
 
+enum tdx_map_type {
+	TDX_MAP_PRIVATE,
+	TDX_MAP_SHARED,
+};
+
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
@@ -112,6 +117,8 @@ unsigned short tdg_inw(unsigned short port);
 unsigned int tdg_inl(unsigned short port);
 
 extern phys_addr_t tdg_shared_mask(void);
+extern int tdg_map_gpa(phys_addr_t gpa, int numpages,
+		       enum tdx_map_type map_type);
 
 #else // !CONFIG_INTEL_TDX_GUEST
 
@@ -155,6 +162,12 @@ static inline phys_addr_t tdg_shared_mask(void)
 {
 	return 0;
 }
+
+static inline int tdg_map_gpa(phys_addr_t gpa, int numpages,
+			      enum tdx_map_type map_type)
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_INTEL_TDX_GUEST */
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 7e391cd7aa2b..074136473011 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -15,6 +15,8 @@
 #include "tdx-kvm.c"
 #endif
 
+#define TDVMCALL_MAP_GPA	0x10001
+
 static struct {
 	unsigned int gpa_width;
 	unsigned long attributes;
@@ -98,6 +100,17 @@ static void tdg_get_info(void)
 	physical_mask &= ~tdg_shared_mask();
 }
 
+int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+{
+	u64 ret;
+
+	if (map_type == TDX_MAP_SHARED)
+		gpa |= tdg_shared_mask();
+
+	ret = tdvmcall(TDVMCALL_MAP_GPA, gpa, PAGE_SIZE * numpages, 0, 0);
+	return ret ? -EIO : 0;
+}
+
 static __cpuidle void tdg_halt(void)
 {
 	u64 ret;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 30/32] x86/tdx: Make DMA pages shared
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (28 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 29/32] x86/tdx: Add helper to do MapGPA TDVMALL Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-18  1:19   ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 31/32] x86/kvm: Use bounce buffers for TD guest Kuppuswamy Sathyanarayanan
                   ` (2 subsequent siblings)
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Kai Huang,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Make force_dma_unencrypted() return true for TDX to get DMA pages mapped
as shared.

__set_memory_enc_dec() is now aware about TDX and sets Shared bit
accordingly following with relevant TDVMCALL.

Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when
converting memory to private.  If the VMM uses a common pool for private
and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA
(or on the first access to the private GPA), in which case TDX-Module will
hold the page in a non-present "pending" state until it is explicitly
accepted.

BUG() if TDACCEPTPAGE fails (except the above case), as the guest is
completely hosed if it can't access memory.

Tested-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/tdx.h       |  3 ++
 arch/x86/kernel/tdx.c            | 26 ++++++++++++++++-
 arch/x86/mm/mem_encrypt_common.c |  4 +--
 arch/x86/mm/pat/set_memory.c     | 48 ++++++++++++++++++++++++++------
 4 files changed, 70 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4789798d7737..2794bf71e45c 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -19,6 +19,9 @@ enum tdx_map_type {
 
 #define TDINFO			1
 #define TDGETVEINFO		3
+#define TDACCEPTPAGE		6
+
+#define TDX_PAGE_ALREADY_ACCEPTED	0x8000000000000001
 
 struct tdcall_output {
 	u64 rcx;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 074136473011..44dd12c693d0 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -100,7 +100,8 @@ static void tdg_get_info(void)
 	physical_mask &= ~tdg_shared_mask();
 }
 
-int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+static int __tdg_map_gpa(phys_addr_t gpa, int numpages,
+			 enum tdx_map_type map_type)
 {
 	u64 ret;
 
@@ -111,6 +112,29 @@ int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
 	return ret ? -EIO : 0;
 }
 
+static void tdg_accept_page(phys_addr_t gpa)
+{
+	u64 ret;
+
+	ret = __tdcall(TDACCEPTPAGE, gpa, 0, 0, 0, NULL);
+
+	BUG_ON(ret && ret != TDX_PAGE_ALREADY_ACCEPTED);
+}
+
+int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+{
+	int ret, i;
+
+	ret = __tdg_map_gpa(gpa, numpages, map_type);
+	if (ret || map_type == TDX_MAP_SHARED)
+		return ret;
+
+	for (i = 0; i < numpages; i++)
+		tdg_accept_page(gpa + i*PAGE_SIZE);
+
+	return 0;
+}
+
 static __cpuidle void tdg_halt(void)
 {
 	u64 ret;
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index 964e04152417..b6d93b0c5dcf 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -15,9 +15,9 @@
 bool force_dma_unencrypted(struct device *dev)
 {
 	/*
-	 * For SEV, all DMA must be to unencrypted/shared addresses.
+	 * For SEV and TDX, all DMA must be to unencrypted/shared addresses.
 	 */
-	if (sev_active())
+	if (sev_active() || is_tdx_guest())
 		return true;
 
 	/*
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..ea78c7907847 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -27,6 +27,7 @@
 #include <asm/proto.h>
 #include <asm/memtype.h>
 #include <asm/set_memory.h>
+#include <asm/tdx.h>
 
 #include "../mm_internal.h"
 
@@ -1972,13 +1973,15 @@ int set_memory_global(unsigned long addr, int numpages)
 				    __pgprot(_PAGE_GLOBAL), 0);
 }
 
-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
+static int __set_memory_protect(unsigned long addr, int numpages, bool protect)
 {
+	pgprot_t mem_protected_bits, mem_plain_bits;
 	struct cpa_data cpa;
+	enum tdx_map_type map_type;
 	int ret;
 
-	/* Nothing to do if memory encryption is not active */
-	if (!mem_encrypt_active())
+	/* Nothing to do if memory encryption and TDX are not active */
+	if (!mem_encrypt_active() && !is_tdx_guest())
 		return 0;
 
 	/* Should not be working on unaligned addresses */
@@ -1988,8 +1991,25 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	memset(&cpa, 0, sizeof(cpa));
 	cpa.vaddr = &addr;
 	cpa.numpages = numpages;
-	cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
-	cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+
+	if (is_tdx_guest()) {
+		mem_protected_bits = __pgprot(0);
+		mem_plain_bits = __pgprot(tdg_shared_mask());
+	} else {
+		mem_protected_bits = __pgprot(_PAGE_ENC);
+		mem_plain_bits = __pgprot(0);
+	}
+
+	if (protect) {
+		cpa.mask_set = mem_protected_bits;
+		cpa.mask_clr = mem_plain_bits;
+		map_type = TDX_MAP_PRIVATE;
+	} else {
+		cpa.mask_set = mem_plain_bits;
+		cpa.mask_clr = mem_protected_bits;
+		map_type = TDX_MAP_SHARED;
+	}
+
 	cpa.pgd = init_mm.pgd;
 
 	/* Must avoid aliasing mappings in the highmem code */
@@ -1998,8 +2018,16 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 
 	/*
 	 * Before changing the encryption attribute, we need to flush caches.
+	 *
+	 * For TDX we need to flush caches on private->shared. VMM is
+	 * responsible for flushing on shared->private.
 	 */
-	cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+	if (is_tdx_guest()) {
+		if (map_type == TDX_MAP_SHARED)
+			cpa_flush(&cpa, 1);
+	} else {
+		cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+	}
 
 	ret = __change_page_attr_set_clr(&cpa, 1);
 
@@ -2012,18 +2040,22 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	 */
 	cpa_flush(&cpa, 0);
 
+	if (!ret && is_tdx_guest()) {
+		ret = tdg_map_gpa(__pa(addr), numpages, map_type);
+	}
+
 	return ret;
 }
 
 int set_memory_encrypted(unsigned long addr, int numpages)
 {
-	return __set_memory_enc_dec(addr, numpages, true);
+	return __set_memory_protect(addr, numpages, true);
 }
 EXPORT_SYMBOL_GPL(set_memory_encrypted);
 
 int set_memory_decrypted(unsigned long addr, int numpages)
 {
-	return __set_memory_enc_dec(addr, numpages, false);
+	return __set_memory_protect(addr, numpages, false);
 }
 EXPORT_SYMBOL_GPL(set_memory_decrypted);
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 31/32] x86/kvm: Use bounce buffers for TD guest
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (29 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 30/32] x86/tdx: Make DMA pages shared Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-06-01  2:03   ` [RFC v2-fix-v1 1/1] " Kuppuswamy Sathyanarayanan
  2021-04-26 18:01 ` [RFC v2 32/32] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kuppuswamy Sathyanarayanan
  2021-05-03 23:21 ` [RFC v2 00/32] Add TDX Guest Support Kuppuswamy, Sathyanarayanan
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

TDX doesn't allow to perform DMA access to guest private memory.
In order for DMA to work properly in TD guest, user SWIOTLB bounce
buffers.

Move AMD SEV initialization into common code and adopt for TDX.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/io.h        |  3 +-
 arch/x86/kernel/pci-swiotlb.c    |  2 +-
 arch/x86/kernel/tdx.c            |  3 ++
 arch/x86/mm/mem_encrypt.c        | 45 ------------------------------
 arch/x86/mm/mem_encrypt_common.c | 47 ++++++++++++++++++++++++++++++++
 5 files changed, 53 insertions(+), 47 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 30a3b30395ad..658d9c2c2a9a 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -257,10 +257,11 @@ static inline void slow_down_io(void)
 
 #endif
 
+extern struct static_key_false sev_enable_key;
+
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 #include <linux/jump_label.h>
 
-extern struct static_key_false sev_enable_key;
 static inline bool sev_key_active(void)
 {
 	return static_branch_unlikely(&sev_enable_key);
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index c2cfa5e7c152..020e13749758 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -49,7 +49,7 @@ int __init pci_swiotlb_detect_4gb(void)
 	 * buffers are allocated and used for devices that do not support
 	 * the addressing range required for the encryption mask.
 	 */
-	if (sme_active())
+	if (sme_active() || is_tdx_guest())
 		swiotlb = 1;
 
 	return swiotlb;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 44dd12c693d0..6b07e7b4a69c 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -8,6 +8,7 @@
 #include <asm/vmx.h>
 #include <asm/insn.h>
 #include <linux/sched/signal.h> /* force_sig_fault() */
+#include <linux/swiotlb.h>
 
 #include <linux/cpu.h>
 
@@ -470,6 +471,8 @@ void __init tdx_early_init(void)
 
 	legacy_pic = &null_legacy_pic;
 
+	swiotlb_force = SWIOTLB_FORCE;
+
 	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "tdg:cpu_hotplug",
 			  NULL, tdg_cpu_offline_prepare);
 
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 6f713c6a32b2..761a98904aa2 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -409,48 +409,3 @@ void __init mem_encrypt_free_decrypted_mem(void)
 
 	free_init_pages("unused decrypted", vaddr, vaddr_end);
 }
-
-static void print_mem_encrypt_feature_info(void)
-{
-	pr_info("AMD Memory Encryption Features active:");
-
-	/* Secure Memory Encryption */
-	if (sme_active()) {
-		/*
-		 * SME is mutually exclusive with any of the SEV
-		 * features below.
-		 */
-		pr_cont(" SME\n");
-		return;
-	}
-
-	/* Secure Encrypted Virtualization */
-	if (sev_active())
-		pr_cont(" SEV");
-
-	/* Encrypted Register State */
-	if (sev_es_active())
-		pr_cont(" SEV-ES");
-
-	pr_cont("\n");
-}
-
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void)
-{
-	if (!sme_me_mask)
-		return;
-
-	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
-	swiotlb_update_mem_attributes();
-
-	/*
-	 * With SEV, we need to unroll the rep string I/O instructions,
-	 * but SEV-ES supports them through the #VC handler.
-	 */
-	if (sev_active() && !sev_es_active())
-		static_branch_enable(&sev_enable_key);
-
-	print_mem_encrypt_feature_info();
-}
-
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
index b6d93b0c5dcf..625c15fa92f9 100644
--- a/arch/x86/mm/mem_encrypt_common.c
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -10,6 +10,7 @@
 #include <linux/mm.h>
 #include <linux/mem_encrypt.h>
 #include <linux/dma-mapping.h>
+#include <linux/swiotlb.h>
 
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
@@ -36,3 +37,49 @@ bool force_dma_unencrypted(struct device *dev)
 
 	return false;
 }
+
+static void print_amd_mem_encrypt_feature_info(void)
+{
+	pr_info("AMD Memory Encryption Features active:");
+
+	/* Secure Memory Encryption */
+	if (sme_active()) {
+		/*
+		 * SME is mutually exclusive with any of the SEV
+		 * features below.
+		 */
+		pr_cont(" SME\n");
+		return;
+	}
+
+	/* Secure Encrypted Virtualization */
+	if (sev_active())
+		pr_cont(" SEV");
+
+	/* Encrypted Register State */
+	if (sev_es_active())
+		pr_cont(" SEV-ES");
+
+	pr_cont("\n");
+}
+
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void)
+{
+	if (!sme_me_mask && !is_tdx_guest())
+		return;
+
+	/* Call into SWIOTLB to update the SWIOTLB DMA buffers */
+	swiotlb_update_mem_attributes();
+
+	/*
+	 * With SEV, we need to unroll the rep string I/O instructions,
+	 * but SEV-ES supports them through the #VC handler.
+	 */
+	if (sev_active() && !sev_es_active())
+		static_branch_enable(&sev_enable_key);
+
+	/* sme_me_mask !=0 means SME or SEV */
+	if (sme_me_mask)
+		print_amd_mem_encrypt_feature_info();
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2 32/32] x86/tdx: ioapic: Add shared bit for IOAPIC base address
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (30 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 31/32] x86/kvm: Use bounce buffers for TD guest Kuppuswamy Sathyanarayanan
@ 2021-04-26 18:01 ` Kuppuswamy Sathyanarayanan
  2021-05-07 23:06   ` Dave Hansen
  2021-05-03 23:21 ` [RFC v2 00/32] Add TDX Guest Support Kuppuswamy, Sathyanarayanan
  32 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-26 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata,
	Kuppuswamy Sathyanarayanan

From: Isaku Yamahata <isaku.yamahata@intel.com>

IOAPIC is emulated by KVM which means its MMIO address is shared
by host. Add shared bit for base address of IOAPIC.
Most MMIO region is handled by ioremap which is already marked
as shared for TDX guest platform, but IOAPIC is an exception which
uses fixed map.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/kernel/apic/io_apic.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 73ff4dd426a8..2a01d4a82be7 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2675,6 +2675,14 @@ static struct resource * __init ioapic_setup_resources(void)
 	return res;
 }
 
+static void io_apic_set_fixmap_nocache(enum fixed_addresses idx, phys_addr_t phys)
+{
+	pgprot_t flags = FIXMAP_PAGE_NOCACHE;
+	if (is_tdx_guest())
+		flags = pgprot_tdg_shared(flags);
+	__set_fixmap(idx, phys, flags);
+}
+
 void __init io_apic_init_mappings(void)
 {
 	unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
@@ -2707,7 +2715,7 @@ void __init io_apic_init_mappings(void)
 				      __func__, PAGE_SIZE, PAGE_SIZE);
 			ioapic_phys = __pa(ioapic_phys);
 		}
-		set_fixmap_nocache(idx, ioapic_phys);
+		io_apic_set_fixmap_nocache(idx, ioapic_phys);
 		apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
 			__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
 			ioapic_phys);
@@ -2836,7 +2844,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
 	ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
 	ioapics[idx].mp_config.apicaddr = address;
 
-	set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
+	io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
 	if (bad_ioapic_register(idx)) {
 		clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
 		return -ENODEV;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-26 18:01 ` [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions Kuppuswamy Sathyanarayanan
@ 2021-04-26 20:32   ` Dave Hansen
  2021-04-26 22:31     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-04-26 20:32 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

> +/*
> + * Expose registers R10-R15 to VMM (for bitfield info
> + * refer to TDX GHCI specification).
> + */
> +#define TDVMCALL_EXPOSE_REGS_MASK	0xfc00

Why can't we do:

#define TDC_R10	BIT(18)
#define TDC_R11	BIT(19)

and:

#define TDVMCALL_EXPOSE_REGS_MASK	(TDX_R10 | TDX_R11 | TDX_R12 ...

or at least:

#define TDVMCALL_EXPOSE_REGS_MASK	BIT(18) | BIT(19) ...

?

> +/*
> + * TDX guests use the TDCALL instruction to make
> + * hypercalls to the VMM. It is supported in
> + * Binutils >= 2.36.
> + */
> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
> +
> +/*
> + * __tdcall()  - Used to communicate with the TDX module

Why is this function here?  What does it do?  Why do we need it?

I'd like this to actually talk about doing impedance matching between
the function call and TDCALL ABIs.

> + * @arg1 (RDI) - TDCALL Leaf ID
> + * @arg2 (RSI) - Input parameter 1 passed to TDX module
> + *               via register RCX
> + * @arg2 (RDX) - Input parameter 2 passed to TDX module
> + *               via register RDX
> + * @arg3 (RCX) - Input parameter 3 passed to TDX module
> + *               via register R8
> + * @arg4 (R8)  - Input parameter 4 passed to TDX module
> + *               via register R9

The unnecessary repitition and verbosity actually make this harder to
read.  This looks like it was easy to write, but not much effort is
being made to make it easy to consume.  Could you please apply some
consideration to making it more readable?


> + * @arg5 (R9)  - struct tdcall_output pointer
> + *
> + * @out        - Return status of tdcall via RAX.

Don't comments usually just say "returns ... foo"?  Also, the @params
usually refer to *REAL* variable names.  Where the heck does "out" come
from?  Why are you even putting argX?  Shouldn't these be @'s be their
literal function argument names?

	@rdi - Input parameter, moved to RCX

> + * NOTE: This function should only used for non TDVMCALL
> + *       use cases
> + */
> +SYM_FUNC_START(__tdcall)
> +	FRAME_BEGIN
> +
> +	/* Save non-volatile GPRs that are exposed to the VMM. */
> +	push %r15
> +	push %r14
> +	push %r13
> +	push %r12

Why do we have to save these?  Because they might be clobbered?  If so,
let's say *THAT* instead of just "exposed".  "Exposed" could mean "VMM
can read".

Also, this just told me that this function can't be used to talk to the
VMM.  Why is this talking about exposure to the VMM?

> +	/* Move TDCALL Leaf ID to RAX */
> +	mov %rdi, %rax
> +	/* Move output pointer to R12 */
> +	mov %r9, %r12

I thought 'struct tdcall_output' was a purely software construct.  Why
are we passing a pointer to it into TDCALL?

> +	/* Move input param 4 to R9 */
> +	mov %r8, %r9
> +	/* Move input param 3 to R8 */
> +	mov %rcx, %r8
> +	/* Leave input param 2 in RDX */
> +	/* Move input param 1 to RCX */
> +	mov %rsi, %rcx

With a little work, this can be made a *LOT* more readable:

	/* Mangle function call ABI into TDCALL ABI: */
	mov %rdi, %rax	/* Move TDCALL Leaf ID to RAX */
	mov %r9,  %r12 	/* Move output pointer to R12 */
	mov %r8,  %r9	/* Move input 4 to R9 */
	mov %rcx, %r8	/* Move input 3 to R8 */
	mov %rsi, %rcx	/* Move input 1 to RCX */
	/* Leave input param 2 in RDX */


> +	tdcall
> +
> +	/* Check for TDCALL success: 0 - Successful, otherwise failed */
> +	test %rax, %rax
> +	jnz 1f
> +
> +	/* Check for a TDCALL output struct */
> +	test %r12, %r12
> +	jz 1f

Does some universal status come back in r12?  Aren't we dealing with a
VMM/SEAM-controlled register here?  Isn't this dangerous?

> +	/* Copy TDCALL result registers to output struct: */
> +	movq %rcx, TDCALL_rcx(%r12)
> +	movq %rdx, TDCALL_rdx(%r12)
> +	movq %r8,  TDCALL_r8(%r12)
> +	movq %r9,  TDCALL_r9(%r12)
> +	movq %r10, TDCALL_r10(%r12)
> +	movq %r11, TDCALL_r11(%r12)
> +1:
> +	/* Zero out registers exposed to the TDX Module. */
> +	xor %rcx,  %rcx
> +	xor %rdx,  %rdx
> +	xor %r8d,  %r8d
> +	xor %r9d,  %r9d
> +	xor %r10d, %r10d
> +	xor %r11d, %r11d

... why?

> +	/* Restore non-volatile GPRs that are exposed to the VMM. */
> +	pop %r12
> +	pop %r13
> +	pop %r14
> +	pop %r15
> +
> +	FRAME_END
> +	ret
> +SYM_FUNC_END(__tdcall)
> +
> +/*
> + * do_tdvmcall()  - Used to communicate with the VMM.
> + *
> + * @arg1 (RDI)    - TDVMCALL function, e.g. exit reason
> + * @arg2 (RSI)    - Input parameter 1 passed to VMM
> + *                  via register R12
> + * @arg3 (RDX)    - Input parameter 2 passed to VMM
> + *                  via register R13
> + * @arg4 (RCX)    - Input parameter 3 passed to VMM
> + *                  via register R14
> + * @arg5 (R8)     - Input parameter 4 passed to VMM
> + *                  via register R15
> + * @arg6 (R9)     - struct tdvmcall_output pointer
> + *
> + * @out           - Return status of tdvmcall(R10) via RAX.
> + *
> + */

Same comments on the sparse comment style.

> +SYM_CODE_START_LOCAL(do_tdvmcall)
> +	FRAME_BEGIN
> +
> +	/* Save non-volatile GPRs that are exposed to the VMM. */
> +	push %r15
> +	push %r14
> +	push %r13
> +	push %r12
> +
> +	/* Set TDCALL leaf ID to TDVMCALL (0) in RAX */

I think there needs to be some discussion of what TDCALL and TDVMCALL
are.  They are named too similarly not to do so.

> +	xor %eax, %eax
> +	/* Move TDVMCALL function id (1st argument) to R11 */
> +	mov %rdi, %r11> +	/* Move Input parameter 1-4 to R12-R15 */
> +	mov %rsi, %r12
> +	mov %rdx, %r13
> +	mov %rcx, %r14
> +	mov %r8,  %r15
> +	/* Leave tdvmcall output pointer in R9 */
> +
> +	/*
> +	 * Value of RCX is used by the TDX Module to determine which
> +	 * registers are exposed to VMM. Each bit in RCX represents a
> +	 * register id. You can find the bitmap details from TDX GHCI
> +	 * spec.
> +	 */

This doesn't belong here.  Put it along with the
TDVMCALL_EXPOSE_REGS_MASK, please.

> +	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
> +
> +	tdcall
> +
> +	/*
> +	 * Check for TDCALL success: 0 - Successful, otherwise failed.
> +	 * If failed, there is an issue with TDX Module which is fatal
> +	 * for the guest. So panic.
> +	 */
> +	test %rax, %rax
> +	jnz 2f

So, just to be clear: %RAX is under the control of the SEAM module.  The
VMM has no control over it.  Right?

Shouldn't we say that explicitly?

> +	/* Move TDVMCALL success/failure to RAX to return to user */
> +	mov %r10, %rax
> +
> +	/* Check for TDVMCALL success: 0 - Successful, otherwise failed */
> +	test %rax, %rax
> +	jnz 1f
> +
> +	/* Check for a TDVMCALL output struct */
> +	test %r9, %r9
> +	jz 1f

I'd also include a note that %r9 was neither writable nor its value
exposed to the VMM.

> +	/* Copy TDVMCALL result registers to output struct: */
> +	movq %r11, TDVMCALL_r11(%r9)
> +	movq %r12, TDVMCALL_r12(%r9)
> +	movq %r13, TDVMCALL_r13(%r9)
> +	movq %r14, TDVMCALL_r14(%r9)
> +	movq %r15, TDVMCALL_r15(%r9)
> +1:
> +	/*
> +	 * Zero out registers exposed to the VMM to avoid
> +	 * speculative execution with VMM-controlled values.
> +	 */
> +	xor %r10d, %r10d
> +	xor %r11d, %r11d
> +	xor %r12d, %r12d
> +	xor %r13d, %r13d
> +	xor %r14d, %r14d
> +	xor %r15d, %r15d
> +
> +	/* Restore non-volatile GPRs that are exposed to the VMM. */
> +	pop %r12
> +	pop %r13
> +	pop %r14
> +	pop %r15
> +
> +	FRAME_END
> +	ret
> +2:
> +	ud2
> +SYM_CODE_END(do_tdvmcall)
> +
> +/* Helper function for standard type of TDVMCALL */
> +SYM_FUNC_START(__tdvmcall)
> +	/* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
> +	xor %r10, %r10
> +	call do_tdvmcall
> +	retq
> +SYM_FUNC_END(__tdvmcall)

Why do we need this helper?  Why does it need to be in assembly?

> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 6a7193fead08..29c52128b9c0 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -1,8 +1,44 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /* Copyright (C) 2020 Intel Corporation */
>  
> +#define pr_fmt(fmt) "TDX: " fmt
> +
>  #include <asm/tdx.h>
>  
> +/*
> + * Wrapper for use case that checks for error code and print warning message.
> + */

This comment isn't very useful.  I can see the error check and warning
by reading the code.

> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
> +{
> +	u64 err;
> +
> +	err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
> +
> +	if (err)
> +		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
> +				    fn, err);
> +
> +	return err;
> +}
> +
> +/*
> + * Wrapper for the semi-common case where we need single output value (R11).
> + */
> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
> +{
> +
> +	struct tdvmcall_output out = {0};
> +	u64 err;
> +
> +	err = __tdvmcall(fn, r12, r13, r14, r15, &out);
> +
> +	if (err)
> +		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
> +				    fn, err);
> +
> +	return out.r11;
> +}

How do callers check for errors?  Is the error value superfluously
returned in r11 and another output register?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option
  2021-04-26 18:01 ` [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option Kuppuswamy Sathyanarayanan
@ 2021-04-26 21:09   ` Randy Dunlap
  2021-04-26 22:32     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Randy Dunlap @ 2021-04-26 21:09 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> Add INTEL_TDX_GUEST config option to selectively compile
> TDX guest support.
> 
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/Kconfig | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 6b4b682af468..932e6d759ba7 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -875,6 +875,21 @@ config ACRN_GUEST
>  	  IOT with small footprint and real-time features. More details can be
>  	  found in https://projectacrn.org/.
>  
> +config INTEL_TDX_GUEST
> +	bool "Intel Trusted Domain eXtensions Guest Support"
> +	depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
> +	depends on SECURITY
> +	select PARAVIRT_XL
> +	select X86_X2APIC
> +	select SECURITY_LOCKDOWN_LSM
> +	help
> +	  Provide support for running in a trusted domain on Intel processors
> +	  equipped with Trusted Domain eXtenstions. TDX is an new Intel

	                                                   a new Intel

> +	  technology that extends VMX and Memory Encryption with a new kind of
> +	  virtual machine guest called Trust Domain (TD). A TD is designed to
> +	  run in a CPU mode that protects the confidentiality of TD memory
> +	  contents and the TD’s CPU state from other software, including VMM.
> +
>  endif #HYPERVISOR_GUEST
>  
>  source "arch/x86/Kconfig.cpu"
> 


-- 
~Randy


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-26 20:32   ` Dave Hansen
@ 2021-04-26 22:31     ` Kuppuswamy, Sathyanarayanan
  2021-04-26 23:17       ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-26 22:31 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 4/26/21 1:32 PM, Dave Hansen wrote:
>> +/*
>> + * Expose registers R10-R15 to VMM (for bitfield info
>> + * refer to TDX GHCI specification).
>> + */
>> +#define TDVMCALL_EXPOSE_REGS_MASK	0xfc00
> 
> Why can't we do:
> 
> #define TDC_R10	BIT(18)
> #define TDC_R11	BIT(19)
> 
> and:
> 
> #define TDVMCALL_EXPOSE_REGS_MASK	(TDX_R10 | TDX_R11 | TDX_R12 ...
> 
> or at least:
> 
> #define TDVMCALL_EXPOSE_REGS_MASK	BIT(18) | BIT(19) ...

If this is the preferred way, I will change it use macros (TDX_Rxx).

> 
> ?
> 
>> +/*
>> + * TDX guests use the TDCALL instruction to make
>> + * hypercalls to the VMM. It is supported in
>> + * Binutils >= 2.36.
>> + */
>> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
>> +
>> +/*
>> + * __tdcall()  - Used to communicate with the TDX module
> 
> Why is this function here?  What does it do?  Why do we need it?

__tdcall() function is used to request services from the TDX Module.
Example use cases are, TDREPORT, VEINFO, TDINFO, etc.

> 
> I'd like this to actually talk about doing impedance matching between
> the function call and TDCALL ABIs.
> 
>> + * @arg1 (RDI) - TDCALL Leaf ID
>> + * @arg2 (RSI) - Input parameter 1 passed to TDX module
>> + *               via register RCX
>> + * @arg2 (RDX) - Input parameter 2 passed to TDX module
>> + *               via register RDX
>> + * @arg3 (RCX) - Input parameter 3 passed to TDX module
>> + *               via register R8
>> + * @arg4 (R8)  - Input parameter 4 passed to TDX module
>> + *               via register R9
> 
> The unnecessary repitition and verbosity actually make this harder to
> read.  This looks like it was easy to write, but not much effort is
> being made to make it easy to consume.  Could you please apply some
> consideration to making it more readable?
> 
> 
>> + * @arg5 (R9)  - struct tdcall_output pointer
>> + *
>> + * @out        - Return status of tdcall via RAX.
> 
> Don't comments usually just say "returns ... foo"?  Also, the @params
> usually refer to *REAL* variable names.  Where the heck does "out" come
> from?  Why are you even putting argX?  Shouldn't these be @'s be their
> literal function argument names?

I have added this comment block to make it easier for us to understand
the register mapping between function arguments and TDCALL ABI. But I got
your point. Usage of @arg1 or @out does not comply the function comment
standards. I will fix this in next version.

> 
> 	@rdi - Input parameter, moved to RCX

I will use the above format to document function arguments.

> 
>> + * NOTE: This function should only used for non TDVMCALL
>> + *       use cases
>> + */
>> +SYM_FUNC_START(__tdcall)
>> +	FRAME_BEGIN
>> +
>> +	/* Save non-volatile GPRs that are exposed to the VMM. */
>> +	push %r15
>> +	push %r14
>> +	push %r13
>> +	push %r12
> 
> Why do we have to save these?  Because they might be clobbered?  If so,
> let's say *THAT* instead of just "exposed".  "Exposed" could mean "VMM
> can read".
> 
> Also, this just told me that this function can't be used to talk to the
> VMM.  Why is this talking about exposure to the VMM?

Although __tdcall() is only used to communicate with the TDX module and the
TDX module is not supposed to touch these registers, just to be on the safe
side, I have tried to save the context of registers R12-R15. Anyway cycles
used by instructions are less compared to tdcall.


> 
>> +	/* Move TDCALL Leaf ID to RAX */
>> +	mov %rdi, %rax
>> +	/* Move output pointer to R12 */
>> +	mov %r9, %r12
> 
> I thought 'struct tdcall_output' was a purely software construct.  Why
> are we passing a pointer to it into TDCALL?

Its used to store the TDCALL result (RCX, RDX, R8-R11). As far as this
function is concerned, its just a block of memory (accessed using
base address + TDCALL_r* offsets).

> 
>> +	/* Move input param 4 to R9 */
>> +	mov %r8, %r9
>> +	/* Move input param 3 to R8 */
>> +	mov %rcx, %r8
>> +	/* Leave input param 2 in RDX */
>> +	/* Move input param 1 to RCX */
>> +	mov %rsi, %rcx
> 
> With a little work, this can be made a *LOT* more readable:
> 
> 	/* Mangle function call ABI into TDCALL ABI: */
> 	mov %rdi, %rax	/* Move TDCALL Leaf ID to RAX */
> 	mov %r9,  %r12 	/* Move output pointer to R12 */
> 	mov %r8,  %r9	/* Move input 4 to R9 */
> 	mov %rcx, %r8	/* Move input 3 to R8 */
> 	mov %rsi, %rcx	/* Move input 1 to RCX */
> 	/* Leave input param 2 in RDX */

Ok. I will use your version.

> 
> 
>> +	tdcall
>> +
>> +	/* Check for TDCALL success: 0 - Successful, otherwise failed */
>> +	test %rax, %rax
>> +	jnz 1f
>> +
>> +	/* Check for a TDCALL output struct */
>> +	test %r12, %r12
>> +	jz 1f
> 
> Does some universal status come back in r12?  Aren't we dealing with a
> VMM/SEAM-controlled register here?  Isn't this dangerous?

R12 is the temporary register we have used to store the address of user
passed output pointer. We just check for NULL condition here. R12 will
not be used by the TDX module.

If you prefer, we can just push the output pointer to stack and get it
after we make the tdcall.

> 
>> +	/* Copy TDCALL result registers to output struct: */
>> +	movq %rcx, TDCALL_rcx(%r12)
>> +	movq %rdx, TDCALL_rdx(%r12)
>> +	movq %r8,  TDCALL_r8(%r12)
>> +	movq %r9,  TDCALL_r9(%r12)
>> +	movq %r10, TDCALL_r10(%r12)
>> +	movq %r11, TDCALL_r11(%r12)
>> +1:
>> +	/* Zero out registers exposed to the TDX Module. */
>> +	xor %rcx,  %rcx
>> +	xor %rdx,  %rdx
>> +	xor %r8d,  %r8d
>> +	xor %r9d,  %r9d
>> +	xor %r10d, %r10d
>> +	xor %r11d, %r11d
> 
> ... why?

These registers are used by the TDX Module. Why pass the stale values
back to the user? So we clear them here.

> 
>> +	/* Restore non-volatile GPRs that are exposed to the VMM. */
>> +	pop %r12
>> +	pop %r13
>> +	pop %r14
>> +	pop %r15
>> +
>> +	FRAME_END
>> +	ret
>> +SYM_FUNC_END(__tdcall)
>> +
>> +/*
>> + * do_tdvmcall()  - Used to communicate with the VMM.
>> + *
>> + * @arg1 (RDI)    - TDVMCALL function, e.g. exit reason
>> + * @arg2 (RSI)    - Input parameter 1 passed to VMM
>> + *                  via register R12
>> + * @arg3 (RDX)    - Input parameter 2 passed to VMM
>> + *                  via register R13
>> + * @arg4 (RCX)    - Input parameter 3 passed to VMM
>> + *                  via register R14
>> + * @arg5 (R8)     - Input parameter 4 passed to VMM
>> + *                  via register R15
>> + * @arg6 (R9)     - struct tdvmcall_output pointer
>> + *
>> + * @out           - Return status of tdvmcall(R10) via RAX.
>> + *
>> + */
> 
> Same comments on the sparse comment style.

will fix it similar to __tdcall().

> 
>> +SYM_CODE_START_LOCAL(do_tdvmcall)
>> +	FRAME_BEGIN
>> +
>> +	/* Save non-volatile GPRs that are exposed to the VMM. */
>> +	push %r15
>> +	push %r14
>> +	push %r13
>> +	push %r12
>> +
>> +	/* Set TDCALL leaf ID to TDVMCALL (0) in RAX */
> 
> I think there needs to be some discussion of what TDCALL and TDVMCALL
> are.  They are named too similarly not to do so.

TDVMCALL is the sub function of TDCALL (selected by setting RAX register
to 0). TDVMCALL is used to request services from VMM.

> 
>> +	xor %eax, %eax
>> +	/* Move TDVMCALL function id (1st argument) to R11 */
>> +	mov %rdi, %r11> +	/* Move Input parameter 1-4 to R12-R15 */
>> +	mov %rsi, %r12
>> +	mov %rdx, %r13
>> +	mov %rcx, %r14
>> +	mov %r8,  %r15
>> +	/* Leave tdvmcall output pointer in R9 */
>> +
>> +	/*
>> +	 * Value of RCX is used by the TDX Module to determine which
>> +	 * registers are exposed to VMM. Each bit in RCX represents a
>> +	 * register id. You can find the bitmap details from TDX GHCI
>> +	 * spec.
>> +	 */
> 
> This doesn't belong here.  Put it along with the
> TDVMCALL_EXPOSE_REGS_MASK, please.

Ok. I will do it.

> 
>> +	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
>> +
>> +	tdcall
>> +
>> +	/*
>> +	 * Check for TDCALL success: 0 - Successful, otherwise failed.
>> +	 * If failed, there is an issue with TDX Module which is fatal
>> +	 * for the guest. So panic.
>> +	 */
>> +	test %rax, %rax
>> +	jnz 2f
> 
> So, just to be clear: %RAX is under the control of the SEAM module.  The
> VMM has no control over it.  Right?

AFAIK, VMM will not touch it.

Sean, please confirm it.

> 
> Shouldn't we say that explicitly?

I can add it to above comment.

> 
>> +	/* Move TDVMCALL success/failure to RAX to return to user */
>> +	mov %r10, %rax
>> +
>> +	/* Check for TDVMCALL success: 0 - Successful, otherwise failed */
>> +	test %rax, %rax
>> +	jnz 1f
>> +
>> +	/* Check for a TDVMCALL output struct */
>> +	test %r9, %r9
>> +	jz 1f
> 
> I'd also include a note that %r9 was neither writable nor its value
> exposed to the VMM.

will do it.

> 
>> +	/* Copy TDVMCALL result registers to output struct: */
>> +	movq %r11, TDVMCALL_r11(%r9)
>> +	movq %r12, TDVMCALL_r12(%r9)
>> +	movq %r13, TDVMCALL_r13(%r9)
>> +	movq %r14, TDVMCALL_r14(%r9)
>> +	movq %r15, TDVMCALL_r15(%r9)
>> +1:
>> +	/*
>> +	 * Zero out registers exposed to the VMM to avoid
>> +	 * speculative execution with VMM-controlled values.
>> +	 */
>> +	xor %r10d, %r10d
>> +	xor %r11d, %r11d
>> +	xor %r12d, %r12d
>> +	xor %r13d, %r13d
>> +	xor %r14d, %r14d
>> +	xor %r15d, %r15d
>> +
>> +	/* Restore non-volatile GPRs that are exposed to the VMM. */
>> +	pop %r12
>> +	pop %r13
>> +	pop %r14
>> +	pop %r15
>> +
>> +	FRAME_END
>> +	ret
>> +2:
>> +	ud2
>> +SYM_CODE_END(do_tdvmcall)
>> +
>> +/* Helper function for standard type of TDVMCALL */
>> +SYM_FUNC_START(__tdvmcall)
>> +	/* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
>> +	xor %r10, %r10
>> +	call do_tdvmcall
>> +	retq
>> +SYM_FUNC_END(__tdvmcall)
> 
> Why do we need this helper?  Why does it need to be in assembly?

Its simpler to do it in assembly. Also, grouping all register updates
in the same file will make it easier for us to read or debug issues. Another
reason is, we also call do_tdvmcall() from in/out instruction use case.

> 
>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>> index 6a7193fead08..29c52128b9c0 100644
>> --- a/arch/x86/kernel/tdx.c
>> +++ b/arch/x86/kernel/tdx.c
>> @@ -1,8 +1,44 @@
>>   // SPDX-License-Identifier: GPL-2.0
>>   /* Copyright (C) 2020 Intel Corporation */
>>   
>> +#define pr_fmt(fmt) "TDX: " fmt
>> +
>>   #include <asm/tdx.h>
>>   
>> +/*
>> + * Wrapper for use case that checks for error code and print warning message.
>> + */
> 
> This comment isn't very useful.  I can see the error check and warning
> by reading the code.

Its just a helper function that covers common case of checking for error
and print the warning message. If this comment is superfluous, I can remove
it.

> 
>> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>> +{
>> +	u64 err;
>> +
>> +	err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
>> +
>> +	if (err)
>> +		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>> +				    fn, err);
>> +
>> +	return err;
>> +}
>> +
>> +/*
>> + * Wrapper for the semi-common case where we need single output value (R11).
>> + */
>> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>> +{
>> +
>> +	struct tdvmcall_output out = {0};
>> +	u64 err;
>> +
>> +	err = __tdvmcall(fn, r12, r13, r14, r15, &out);
>> +
>> +	if (err)
>> +		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>> +				    fn, err);
>> +
>> +	return out.r11;
>> +}
> 
> How do callers check for errors?  Is the error value superfluously
> returned in r11 and another output register?

We already check for error in this helper function. User of this function
only cares about output value (R11). Mainly for in/out use case.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 02/32] x86/tdx: Introduce INTEL_TDX_GUEST config option
  2021-04-26 21:09   ` Randy Dunlap
@ 2021-04-26 22:32     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-26 22:32 UTC (permalink / raw)
  To: Randy Dunlap, Peter Zijlstra, Andy Lutomirski, Dave Hansen,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 4/26/21 2:09 PM, Randy Dunlap wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>> Add INTEL_TDX_GUEST config option to selectively compile
>> TDX guest support.
>>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>> Reviewed-by: Tony Luck <tony.luck@intel.com>
>> ---
>>   arch/x86/Kconfig | 15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 6b4b682af468..932e6d759ba7 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -875,6 +875,21 @@ config ACRN_GUEST
>>   	  IOT with small footprint and real-time features. More details can be
>>   	  found in https://projectacrn.org/.
>>   
>> +config INTEL_TDX_GUEST
>> +	bool "Intel Trusted Domain eXtensions Guest Support"
>> +	depends on X86_64 && CPU_SUP_INTEL && PARAVIRT
>> +	depends on SECURITY
>> +	select PARAVIRT_XL
>> +	select X86_X2APIC
>> +	select SECURITY_LOCKDOWN_LSM
>> +	help
>> +	  Provide support for running in a trusted domain on Intel processors
>> +	  equipped with Trusted Domain eXtenstions. TDX is an new Intel
> 
> 	                                                   a new Intel
> 

Good catch. I will fix it in next version.

>> +	  technology that extends VMX and Memory Encryption with a new kind of
>> +	  virtual machine guest called Trust Domain (TD). A TD is designed to
>> +	  run in a CPU mode that protects the confidentiality of TD memory
>> +	  contents and the TD’s CPU state from other software, including VMM.
>> +
>>   endif #HYPERVISOR_GUEST
>>   
>>   source "arch/x86/Kconfig.cpu"
>>
> 
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-26 22:31     ` Kuppuswamy, Sathyanarayanan
@ 2021-04-26 23:17       ` Dave Hansen
  2021-04-27  2:29         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-04-26 23:17 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 3:31 PM, Kuppuswamy, Sathyanarayanan wrote:
>>> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
>>> +
>>> +/*
>>> + * __tdcall()  - Used to communicate with the TDX module
>>
>> Why is this function here?  What does it do?  Why do we need it?
> 
> __tdcall() function is used to request services from the TDX Module.
> Example use cases are, TDREPORT, VEINFO, TDINFO, etc.

I think there might be some misinterpretation of my question.  What you
are describing is what *TDCALL* does.  Why do we need a wrapper
function?  What purpose does this wrapper function serve?  Why do we
need this wrapper function?

>>> + * NOTE: This function should only used for non TDVMCALL
>>> + *       use cases
>>> + */
>>> +SYM_FUNC_START(__tdcall)
>>> +    FRAME_BEGIN
>>> +
>>> +    /* Save non-volatile GPRs that are exposed to the VMM. */
>>> +    push %r15
>>> +    push %r14
>>> +    push %r13
>>> +    push %r12
>>
>> Why do we have to save these?  Because they might be clobbered?  If so,
>> let's say *THAT* instead of just "exposed".  "Exposed" could mean "VMM
>> can read".
>>
>> Also, this just told me that this function can't be used to talk to the
>> VMM.  Why is this talking about exposure to the VMM?
> 
> Although __tdcall() is only used to communicate with the TDX module and the
> TDX module is not supposed to touch these registers, just to be on the safe
> side, I have tried to save the context of registers R12-R15. Anyway cycles
> used by instructions are less compared to tdcall.

Why are you talking about the VMM if this is a call to the SEAM module?

Let's say someone is reading the TDCALL architecture spec.  It will say
something like, "blah blah, in this case TDCALL will not modify
%r12->%r15".  Then someone goes and looks at this code that basically
says (or implies) "save these before the SEAM module modifies them".
What is a coder to do?

Please remove the ambiguity, either by removing this superfluous
(according to the spec) code, or documenting why it is not superfluous.

>>> +    /* Move TDCALL Leaf ID to RAX */
>>> +    mov %rdi, %rax
>>> +    /* Move output pointer to R12 */
>>> +    mov %r9, %r12
>>
>> I thought 'struct tdcall_output' was a purely software construct.  Why
>> are we passing a pointer to it into TDCALL?
> 
> Its used to store the TDCALL result (RCX, RDX, R8-R11). As far as this
> function is concerned, its just a block of memory (accessed using
> base address + TDCALL_r* offsets).

Is 'struct tdcall_output' a hardware architectural structure or a
software structure?

If it's a software structure, then why are we passing a pointer to a
software structure into a hardware ABI?

If it's a hardware architecture structure, where is the documentation
for it?

>>> +    tdcall
>>> +
>>> +    /* Check for TDCALL success: 0 - Successful, otherwise failed */
>>> +    test %rax, %rax
>>> +    jnz 1f
>>> +
>>> +    /* Check for a TDCALL output struct */
>>> +    test %r12, %r12
>>> +    jz 1f
>>
>> Does some universal status come back in r12?  Aren't we dealing with a
>> VMM/SEAM-controlled register here?  Isn't this dangerous?
> 
> R12 is the temporary register we have used to store the address of user
> passed output pointer. We just check for NULL condition here. R12 will
> not be used by the TDX module.

OK, so how do you know this?  Could you share your logic, please?

> If you prefer, we can just push the output pointer to stack and get it
> after we make the tdcall.

I prefer that the code be understandable and be written for a clear
purpose.  If you're using r12 for temporary storage, I expect to see at
least one reference *SOMEWHERE* to its use as temporary storage.  Right
now.... nothing.

>>> +    /* Copy TDCALL result registers to output struct: */
>>> +    movq %rcx, TDCALL_rcx(%r12)
>>> +    movq %rdx, TDCALL_rdx(%r12)
>>> +    movq %r8,  TDCALL_r8(%r12)
>>> +    movq %r9,  TDCALL_r9(%r12)
>>> +    movq %r10, TDCALL_r10(%r12)
>>> +    movq %r11, TDCALL_r11(%r12)
>>> +1:
>>> +    /* Zero out registers exposed to the TDX Module. */
>>> +    xor %rcx,  %rcx
>>> +    xor %rdx,  %rdx
>>> +    xor %r8d,  %r8d
>>> +    xor %r9d,  %r9d
>>> +    xor %r10d, %r10d
>>> +    xor %r11d, %r11d
>>
>> ... why?
> 
> These registers are used by the TDX Module. Why pass the stale values
> back to the user? So we clear them here.

Please go look at some other assembly code in the kernel called from C.
 Do those functions do this?  Why?  Why not?  Do they care about
"passing stale values back up"?

>>> +SYM_CODE_START_LOCAL(do_tdvmcall)
>>> +    FRAME_BEGIN
>>> +
>>> +    /* Save non-volatile GPRs that are exposed to the VMM. */
>>> +    push %r15
>>> +    push %r14
>>> +    push %r13
>>> +    push %r12
>>> +
>>> +    /* Set TDCALL leaf ID to TDVMCALL (0) in RAX */
>>
>> I think there needs to be some discussion of what TDCALL and TDVMCALL
>> are.  They are named too similarly not to do so.
> 
> TDVMCALL is the sub function of TDCALL (selected by setting RAX register
> to 0). TDVMCALL is used to request services from VMM.

Actually, I think these functions are horribly misnamed.

I think we should make them

	__tdx_seam_call()
or	__tdx_module_call()

and

	__tdx_hypercall()


	__tdcall()
and
	__tdvmcall()

are really nonsensical in this context, especially since TDVMCALL is
implemented with the TDCALL instruction, but not the __tdcall() function.

>>> +/* Helper function for standard type of TDVMCALL */
>>> +SYM_FUNC_START(__tdvmcall)
>>> +    /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
>>> +    xor %r10, %r10
>>> +    call do_tdvmcall
>>> +    retq
>>> +SYM_FUNC_END(__tdvmcall)
>>
>> Why do we need this helper?  Why does it need to be in assembly?
> 
> Its simpler to do it in assembly. Also, grouping all register updates
> in the same file will make it easier for us to read or debug issues.
> Another
> reason is, we also call do_tdvmcall() from in/out instruction use case.

Sathya, I seem to have to reverse-engineer what you are doing for all
this stuff.  Your answers to my questions are almost entirely orthogonal
to the things I really want to know.  I guess I need to be more precise
with the questions I'm asking.  But, this is yet another case where I
think the burden for this series continues to fall on the reviewer
rather than the submitter.  Not the way I think it is best.

So, trying to reverse-engineer what you are doing here... it seems that
you can't *practically* call do_tdvmcall() directly because %r10 would
be garbage.  That makes this (or a wrapper like it) required for every
practical call to do_tdvmcall().

But, even if that's the case, you need to *DOCUMENT* that up in
do_tdvmcall(): Hey, this function is worthless without something that
sets up %r10 before calling it.

I'm also not *SURE* this is simpler to do in assembly.

>>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>>> index 6a7193fead08..29c52128b9c0 100644
>>> --- a/arch/x86/kernel/tdx.c
>>> +++ b/arch/x86/kernel/tdx.c
>>> @@ -1,8 +1,44 @@
>>>   // SPDX-License-Identifier: GPL-2.0
>>>   /* Copyright (C) 2020 Intel Corporation */
>>>   +#define pr_fmt(fmt) "TDX: " fmt
>>> +
>>>   #include <asm/tdx.h>
>>>   +/*
>>> + * Wrapper for use case that checks for error code and print warning
>>> message.
>>> + */
>>
>> This comment isn't very useful.  I can see the error check and warning
>> by reading the code.
> 
> Its just a helper function that covers common case of checking for error
> and print the warning message. If this comment is superfluous, I can remove
> it.

I'd prefer that you actually write a comment about what the function is
doing, maybe:

/*
 * Wrapper for simple hypercalls that only return a success/error code.
 */

... or *SOMETHING* that tells what its purpose in life is.

>>> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>>> +{
>>> +    u64 err;
>>> +
>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
>>> +
>>> +    if (err)
>>> +        pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>>> +                    fn, err);
>>> +
>>> +    return err;
>>> +}
>>> +
>>> +/*
>>> + * Wrapper for the semi-common case where we need single output
>>> value (R11).
>>> + */
>>> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64
>>> r14, u64 r15)
>>> +{
>>> +
>>> +    struct tdvmcall_output out = {0};
>>> +    u64 err;
>>> +
>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, &out);
>>> +
>>> +    if (err)
>>> +        pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>>> +                    fn, err);
>>> +
>>> +    return out.r11;
>>> +}
>>
>> How do callers check for errors?  Is the error value superfluously
>> returned in r11 and another output register?
> 
> We already check for error in this helper function. User of this function
> only cares about output value (R11). Mainly for in/out use case.

That's pretty valuable information.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-26 23:17       ` Dave Hansen
@ 2021-04-27  2:29         ` Kuppuswamy, Sathyanarayanan
  2021-04-27 14:29           ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-27  2:29 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 4/26/21 4:17 PM, Dave Hansen wrote:
> On 4/26/21 3:31 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
>>>> +
>>>> +/*
>>>> + * __tdcall()  - Used to communicate with the TDX module
>>>
>>> Why is this function here?  What does it do?  Why do we need it?
>>
>> __tdcall() function is used to request services from the TDX Module.
>> Example use cases are, TDREPORT, VEINFO, TDINFO, etc.
> 
> I think there might be some misinterpretation of my question.  What you
> are describing is what *TDCALL* does.  Why do we need a wrapper
> function?  What purpose does this wrapper function serve?  Why do we
> need this wrapper function?
> 

How about following explanation?

Helper function for "tdcall" instruction, which can be used to request
services from the TDX module (does not include VMM). Few examples of
valid TDX module services are, "TDREPORT", "MEM PAGE ACCEPT", "VEINFO",
etc.

This function serves as a wrapper to move user call arguments to
the correct registers as specified by "tdcall" ABI and shares it with
the TDX module.  If the "tdcall" operation is successful and a
valid "struct tdcall_out" pointer is available (in "out" argument),
output from the TDX module (RCX, RDX, R8-R11) is saved to the memory
specified in the "out" pointer. Also the status of the "tdcall"
operation is returned back to the user as a function return value.

>>> Why do we have to save these?  Because they might be clobbered?  If so,
>>> let's say *THAT* instead of just "exposed".  "Exposed" could mean "VMM
>>> can read".
>>>
>>> Also, this just told me that this function can't be used to talk to the
>>> VMM.  Why is this talking about exposure to the VMM?
>>
>> Although __tdcall() is only used to communicate with the TDX module and the
>> TDX module is not supposed to touch these registers, just to be on the safe
>> side, I have tried to save the context of registers R12-R15. Anyway cycles
>> used by instructions are less compared to tdcall.
> 
> Why are you talking about the VMM if this is a call to the SEAM module?
> 
> Let's say someone is reading the TDCALL architecture spec.  It will say
> something like, "blah blah, in this case TDCALL will not modify
> %r12->%r15".  Then someone goes and looks at this code that basically
> says (or implies) "save these before the SEAM module modifies them".
> What is a coder to do?
> 
> Please remove the ambiguity, either by removing this superfluous
> (according to the spec) code, or documenting why it is not superfluous.

Agree. I will remove the save/restore context code.

> 
>>>> +    /* Move TDCALL Leaf ID to RAX */
>>>> +    mov %rdi, %rax
>>>> +    /* Move output pointer to R12 */
>>>> +    mov %r9, %r12
>>>
>>> I thought 'struct tdcall_output' was a purely software construct.  Why
>>> are we passing a pointer to it into TDCALL?
>>
>> Its used to store the TDCALL result (RCX, RDX, R8-R11). As far as this
>> function is concerned, its just a block of memory (accessed using
>> base address + TDCALL_r* offsets).
> 
> Is 'struct tdcall_output' a hardware architectural structure or a
> software structure?
> 
> If it's a software structure, then why are we passing a pointer to a
> software structure into a hardware ABI?
> 
> If it's a hardware architecture structure, where is the documentation
> for it?
> 

I think there is a misunderstanding here. We don't share the tdcall_output
pointer with the TDX module. Current use cases of TDCALL (other than TDVMCALL)
do not use registers from R12-R15. Since the registers R12-R15 are free and
available, we are using R12 as temporary storage to hold the tdcall_output
pointer.

I will include some comment about using it as temporary storage.


> 
> I prefer that the code be understandable and be written for a clear
> purpose.  If you're using r12 for temporary storage, I expect to see at
> least one reference *SOMEWHERE* to its use as temporary storage.  Right
> now.... nothing.
> 

I will include some reference to it.

>>>> +    /* Copy TDCALL result registers to output struct: */
>>>> +    movq %rcx, TDCALL_rcx(%r12)
>>>> +    movq %rdx, TDCALL_rdx(%r12)
>>>> +    movq %r8,  TDCALL_r8(%r12)
>>>> +    movq %r9,  TDCALL_r9(%r12)
>>>> +    movq %r10, TDCALL_r10(%r12)
>>>> +    movq %r11, TDCALL_r11(%r12)
>>>> +1:
>>>> +    /* Zero out registers exposed to the TDX Module. */
>>>> +    xor %rcx,  %rcx
>>>> +    xor %rdx,  %rdx
>>>> +    xor %r8d,  %r8d
>>>> +    xor %r9d,  %r9d
>>>> +    xor %r10d, %r10d
>>>> +    xor %r11d, %r11d
>>>
>>> ... why?
>>
>> These registers are used by the TDX Module. Why pass the stale values
>> back to the user? So we clear them here.
> 
> Please go look at some other assembly code in the kernel called from C.
>   Do those functions do this?  Why?  Why not?  Do they care about
> "passing stale values back up"?
> 

Maybe I am being overly cautious here. Since TDX module is the trusted
code, speculation attack is not a consideration here. I will remove this
block of code.

>>>> +SYM_CODE_START_LOCAL(do_tdvmcall)
>>>> +    FRAME_BEGIN
>>>> +
>>>> +    /* Save non-volatile GPRs that are exposed to the VMM. */
>>>> +    push %r15
>>>> +    push %r14
>>>> +    push %r13
>>>> +    push %r12
>>>> +
>>>> +    /* Set TDCALL leaf ID to TDVMCALL (0) in RAX */
>>>
>>> I think there needs to be some discussion of what TDCALL and TDVMCALL
>>> are.  They are named too similarly not to do so.
>>
>> TDVMCALL is the sub function of TDCALL (selected by setting RAX register
>> to 0). TDVMCALL is used to request services from VMM.
> 
> Actually, I think these functions are horribly misnamed.
> 
> I think we should make them
> 
> 	__tdx_seam_call()
> or	__tdx_module_call()
> 
> and
> 
> 	__tdx_hypercall()
> 
> 
> 	__tdcall()
> and
> 	__tdvmcall()
> 
> are really nonsensical in this context, especially since TDVMCALL is
> implemented with the TDCALL instruction, but not the __tdcall() function.
> 

TDVMCALL is a short form of "TDG.VP.VMCALL". This term usage came from
GHCI document. We can read it as "Trusted Domain VMCALL". Maybe
because we are used to GHCI spec, we don't find it confusing. I agree
that if you consider the "tdcall" instruction usage, it is confusing.

But if it's confusing for new readers and rename is preferred,

Do we need to rename the helper functions ?

tdvmcall(), tdvmcall_out_r11()

Also what about output structs?

struct tdcall_output
struct tdvmcall_output

>>>> +/* Helper function for standard type of TDVMCALL */
>>>> +SYM_FUNC_START(__tdvmcall)
>>>> +    /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
>>>> +    xor %r10, %r10
>>>> +    call do_tdvmcall
>>>> +    retq
>>>> +SYM_FUNC_END(__tdvmcall)
>>>
>>> Why do we need this helper?  Why does it need to be in assembly?
>>
>> Its simpler to do it in assembly. Also, grouping all register updates
>> in the same file will make it easier for us to read or debug issues.
>> Another
>> reason is, we also call do_tdvmcall() from in/out instruction use case.
> 
> Sathya, I seem to have to reverse-engineer what you are doing for all
> this stuff.  Your answers to my questions are almost entirely orthogonal
> to the things I really want to know.  I guess I need to be more precise
> with the questions I'm asking.  But, this is yet another case where I
> think the burden for this series continues to fall on the reviewer
> rather than the submitter.  Not the way I think it is best.

I have assumed that you are aware of reason for the existence of
do_tdvmcall() helper function. It is mainly created to hold common
code between vendor specific and standard type of tdvmcall's.

But it is a mistake from my end. I will try to be elaborate in my
future replies.

> 
> So, trying to reverse-engineer what you are doing here... it seems that
> you can't *practically* call do_tdvmcall() directly because %r10 would
> be garbage.  That makes this (or a wrapper like it) required for every
> practical call to do_tdvmcall().
> 
> But, even if that's the case, you need to *DOCUMENT* that up in
> do_tdvmcall(): Hey, this function is worthless without something that
> sets up %r10 before calling it.

Agree. This needs to be documented. I will add it in next version.

> 
> I'm also not *SURE* this is simpler to do in assembly.
> 
>>>> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
>>>> index 6a7193fead08..29c52128b9c0 100644
>>>> --- a/arch/x86/kernel/tdx.c
>>>> +++ b/arch/x86/kernel/tdx.c
>>>> @@ -1,8 +1,44 @@
>>>>    // SPDX-License-Identifier: GPL-2.0
>>>>    /* Copyright (C) 2020 Intel Corporation */
>>>>    +#define pr_fmt(fmt) "TDX: " fmt
>>>> +
>>>>    #include <asm/tdx.h>
>>>>    +/*
>>>> + * Wrapper for use case that checks for error code and print warning
>>>> message.
>>>> + */
>>>
>>> This comment isn't very useful.  I can see the error check and warning
>>> by reading the code.
>>
>> Its just a helper function that covers common case of checking for error
>> and print the warning message. If this comment is superfluous, I can remove
>> it.
> 
> I'd prefer that you actually write a comment about what the function is
> doing, maybe:
> 
> /*
>   * Wrapper for simple hypercalls that only return a success/error code.
>   */
> 
> ... or *SOMETHING* that tells what its purpose in life is.

I will fix it in next version.

> 
>>>> +static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>>>> +{
>>>> +    u64 err;
>>>> +
>>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
>>>> +
>>>> +    if (err)
>>>> +        pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>>>> +                    fn, err);
>>>> +
>>>> +    return err;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Wrapper for the semi-common case where we need single output
>>>> value (R11).
>>>> + */
>>>> +static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64
>>>> r14, u64 r15)
>>>> +{
>>>> +
>>>> +    struct tdvmcall_output out = {0};
>>>> +    u64 err;
>>>> +
>>>> +    err = __tdvmcall(fn, r12, r13, r14, r15, &out);
>>>> +
>>>> +    if (err)
>>>> +        pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
>>>> +                    fn, err);
>>>> +
>>>> +    return out.r11;
>>>> +}
>>>
>>> How do callers check for errors?  Is the error value superfluously
>>> returned in r11 and another output register?
>>
>> We already check for error in this helper function. User of this function
>> only cares about output value (R11). Mainly for in/out use case.
> 
> That's pretty valuable information.

I will include this note in the function comment.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-27  2:29         ` Kuppuswamy, Sathyanarayanan
@ 2021-04-27 14:29           ` Dave Hansen
  2021-04-27 19:18             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-04-27 14:29 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 7:29 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 4/26/21 4:17 PM, Dave Hansen wrote:
>> On 4/26/21 3:31 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>>> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
>>>>> +
>>>>> +/*
>>>>> + * __tdcall()  - Used to communicate with the TDX module
>>>>
>>>> Why is this function here?  What does it do?  Why do we need it?
>>>
>>> __tdcall() function is used to request services from the TDX Module.
>>> Example use cases are, TDREPORT, VEINFO, TDINFO, etc.
>>
>> I think there might be some misinterpretation of my question.  What you
>> are describing is what *TDCALL* does.  Why do we need a wrapper
>> function?  What purpose does this wrapper function serve?  Why do we
>> need this wrapper function?
>>
> How about following explanation?
> 
> Helper function for "tdcall" instruction, which can be used to request
> services from the TDX module (does not include VMM). Few examples of
> valid TDX module services are, "TDREPORT", "MEM PAGE ACCEPT", "VEINFO",
> etc.

Naming the services here is not useful.  If I want to know who calls
this, I'll just literally do that: look up the callers of this function.

> This function serves as a wrapper to move user call arguments to
> the correct registers as specified by "tdcall" ABI and shares it with
> the TDX module.  If the "tdcall" operation is successful and a
> valid "struct tdcall_out" pointer is available (in "out" argument),
> output from the TDX module (RCX, RDX, R8-R11) is saved to the memory
> specified in the "out" pointer. Also the status of the "tdcall"
> operation is returned back to the user as a function return value.

I tend to prefer function comments that talk high-level about what the
function does rather than waste space on the exact registers used in the
ABI.  I also tend not to talk about things that can be trivially grepped
for, like the callers of this function.

I'd trim the fat out of there, but it's generally OK, although too
rotund for my taste.

>>>>> +    /* Move TDCALL Leaf ID to RAX */
>>>>> +    mov %rdi, %rax
>>>>> +    /* Move output pointer to R12 */
>>>>> +    mov %r9, %r12
>>>>
>>>> I thought 'struct tdcall_output' was a purely software construct.  Why
>>>> are we passing a pointer to it into TDCALL?
>>>
>>> Its used to store the TDCALL result (RCX, RDX, R8-R11). As far as this
>>> function is concerned, its just a block of memory (accessed using
>>> base address + TDCALL_r* offsets).
>>
>> Is 'struct tdcall_output' a hardware architectural structure or a
>> software structure?
>>
>> If it's a software structure, then why are we passing a pointer to a
>> software structure into a hardware ABI?
>>
>> If it's a hardware architecture structure, where is the documentation
>> for it?
>>
> 
> I think there is a misunderstanding here. We don't share the tdcall_output
> pointer with the TDX module. Current use cases of TDCALL (other than
> TDVMCALL)
> do not use registers from R12-R15. Since the registers R12-R15 are free and
> available, we are using R12 as temporary storage to hold the tdcall_output
> pointer.

In other words, 'struct tdcall_output' is a purely software concept.
However, its pointer is manipulated literally next to all of the TDCALL
register arguments and it has an *IDENTICAL* comment to all of those
other moves.

Please make it clear that %r12 is not being used at all for the TDCALL
instruction itself.

But, the bigger point here is that the code needs to be structured in a
way that makes the function and interactions clear.  If I want to know
more about the "output pointer", where do I go?  Do I go looking at the
calling functions or the TDINFO instruction reference?

...
>> Please go look at some other assembly code in the kernel called from C.
>>   Do those functions do this?  Why?  Why not?  Do they care about
>> "passing stale values back up"?
> 
> Maybe I am being overly cautious here. Since TDX module is the trusted
> code, speculation attack is not a consideration here. I will remove this
> block of code.

Caution is OK.  Caution without explanation somewhere as to why it is
warranted is not.

> Do we need to rename the helper functions ?
> 
> tdvmcall(), tdvmcall_out_r11()

Yes.

> Also what about output structs?
> 
> struct tdcall_output
> struct tdvmcall_output

Yes, they need sane, straightforward names which are not confusing too.

>>>>> +/* Helper function for standard type of TDVMCALL */
>>>>> +SYM_FUNC_START(__tdvmcall)
>>>>> +    /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
>>>>> +    xor %r10, %r10
>>>>> +    call do_tdvmcall
>>>>> +    retq
>>>>> +SYM_FUNC_END(__tdvmcall)
>>>>
>>>> Why do we need this helper?  Why does it need to be in assembly?
>>>
>>> Its simpler to do it in assembly. Also, grouping all register updates
>>> in the same file will make it easier for us to read or debug issues.
>>> Another
>>> reason is, we also call do_tdvmcall() from in/out instruction use case.
>>
>> Sathya, I seem to have to reverse-engineer what you are doing for all
>> this stuff.  Your answers to my questions are almost entirely orthogonal
>> to the things I really want to know.  I guess I need to be more precise
>> with the questions I'm asking.  But, this is yet another case where I
>> think the burden for this series continues to fall on the reviewer
>> rather than the submitter.  Not the way I think it is best.
> 
> I have assumed that you are aware of reason for the existence of
> do_tdvmcall() helper function. It is mainly created to hold common
> code between vendor specific and standard type of tdvmcall's.

No, I was not aware of that.  Remember, you're not doing this for *ME*.
 You're doing it for the hundred other people that are going to look
over the code and who won't have been aware of your reasoning.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-04-26 18:01 ` [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL Kuppuswamy Sathyanarayanan
@ 2021-04-27 17:31   ` Borislav Petkov
  2021-05-06 14:59     ` Kirill A. Shutemov
  2021-05-10  8:07     ` Juergen Gross
  0 siblings, 2 replies; 381+ messages in thread
From: Borislav Petkov @ 2021-04-27 17:31 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel, Jürgen Gross

+ Jürgen.

On Mon, Apr 26, 2021 at 11:01:28AM -0700, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Split off halt paravirt calls from CONFIG_PARAVIRT_XXL into
> a separate config option. It provides a middle ground for
> not-so-deep paravirtulized environments.

Please introduce a spellchecker into your patch creation workflow.

Also, what does "not-so-deep" mean?

> CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
> calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
> config would be a bloat for TDX.

Used how? Why is it bloat for TDX?

I'm sure that'll become clear in the remainder of the patches but you
should state it here so that it is clear why you're doing what you're
doing.

> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/Kconfig                      |  4 +++
>  arch/x86/boot/compressed/misc.h       |  1 +
>  arch/x86/include/asm/irqflags.h       | 38 +++++++++++++++------------
>  arch/x86/include/asm/paravirt.h       | 22 +++++++++-------
>  arch/x86/include/asm/paravirt_types.h |  3 ++-
>  arch/x86/kernel/paravirt.c            |  4 ++-
>  arch/x86/mm/mem_encrypt_identity.c    |  1 +
>  7 files changed, 44 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2792879d398e..6b4b682af468 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -783,8 +783,12 @@ config PARAVIRT
>  	  over full virtualization.  However, when run without a hypervisor
>  	  the kernel is theoretically slower and slightly larger.
>  
> +config PARAVIRT_XL
> +	bool
> +
>  config PARAVIRT_XXL
>  	bool
> +	select PARAVIRT_XL
>  
>  config PARAVIRT_DEBUG
>  	bool "paravirt-ops debugging"
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index 901ea5ebec22..4b84abe43765 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -9,6 +9,7 @@
>   * paravirt and debugging variants are added.)
>   */
>  #undef CONFIG_PARAVIRT
> +#undef CONFIG_PARAVIRT_XL
>  #undef CONFIG_PARAVIRT_XXL

So what happens if someone else needs even less pv and defines
CONFIG_PARAVIRT_L. Or _M? Or _S?

Are we going to teleport into a clothing store each time we look at
paravirt now? :)

So before this goes out of hand let's define explicitly, pls, what
XXL means and XL. And rename them. They could be called PARAVIRT_FULL
and PARAVIRT_HLT as apparently that thing is exposing only the PV ops
related to HLT.

Or something to that effect.

Dunno, maybe Jürgen has a better idea, leaving in the rest quoted for him.

Thx.

>  #undef CONFIG_PARAVIRT_SPINLOCKS
>  #undef CONFIG_KASAN
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index 144d70ea4393..1688841893d7 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -59,27 +59,11 @@ static inline __cpuidle void native_halt(void)
>  
>  #endif
>  
> -#ifdef CONFIG_PARAVIRT_XXL
> +#ifdef CONFIG_PARAVIRT_XL
>  #include <asm/paravirt.h>
>  #else
>  #ifndef __ASSEMBLY__
>  #include <linux/types.h>
> -
> -static __always_inline unsigned long arch_local_save_flags(void)
> -{
> -	return native_save_fl();
> -}
> -
> -static __always_inline void arch_local_irq_disable(void)
> -{
> -	native_irq_disable();
> -}
> -
> -static __always_inline void arch_local_irq_enable(void)
> -{
> -	native_irq_enable();
> -}
> -
>  /*
>   * Used in the idle loop; sti takes one instruction cycle
>   * to complete:
> @@ -97,6 +81,26 @@ static inline __cpuidle void halt(void)
>  {
>  	native_halt();
>  }
> +#endif /* !__ASSEMBLY__ */
> +#endif /* CONFIG_PARAVIRT_XL */
> +
> +#ifndef CONFIG_PARAVIRT_XXL
> +#ifndef __ASSEMBLY__
> +
> +static __always_inline unsigned long arch_local_save_flags(void)
> +{
> +	return native_save_fl();
> +}
> +
> +static __always_inline void arch_local_irq_disable(void)
> +{
> +	native_irq_disable();
> +}
> +
> +static __always_inline void arch_local_irq_enable(void)
> +{
> +	native_irq_enable();
> +}
>  
>  /*
>   * For spinlocks, etc:
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index 4abf110e2243..2dbb6c9c7e98 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -84,6 +84,18 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
>  	PVOP_VCALL1(mmu.exit_mmap, mm);
>  }
>  
> +#ifdef CONFIG_PARAVIRT_XL
> +static inline void arch_safe_halt(void)
> +{
> +	PVOP_VCALL0(irq.safe_halt);
> +}
> +
> +static inline void halt(void)
> +{
> +	PVOP_VCALL0(irq.halt);
> +}
> +#endif
> +
>  #ifdef CONFIG_PARAVIRT_XXL
>  static inline void load_sp0(unsigned long sp0)
>  {
> @@ -145,16 +157,6 @@ static inline void __write_cr4(unsigned long x)
>  	PVOP_VCALL1(cpu.write_cr4, x);
>  }
>  
> -static inline void arch_safe_halt(void)
> -{
> -	PVOP_VCALL0(irq.safe_halt);
> -}
> -
> -static inline void halt(void)
> -{
> -	PVOP_VCALL0(irq.halt);
> -}
> -
>  static inline void wbinvd(void)
>  {
>  	PVOP_VCALL0(cpu.wbinvd);
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index de87087d3bde..5261fba47ba5 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -177,7 +177,8 @@ struct pv_irq_ops {
>  	struct paravirt_callee_save save_fl;
>  	struct paravirt_callee_save irq_disable;
>  	struct paravirt_callee_save irq_enable;
> -
> +#endif
> +#ifdef CONFIG_PARAVIRT_XL
>  	void (*safe_halt)(void);
>  	void (*halt)(void);
>  #endif
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index c60222ab8ab9..d6d0b363fe70 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -322,9 +322,11 @@ struct paravirt_patch_template pv_ops = {
>  	.irq.save_fl		= __PV_IS_CALLEE_SAVE(native_save_fl),
>  	.irq.irq_disable	= __PV_IS_CALLEE_SAVE(native_irq_disable),
>  	.irq.irq_enable		= __PV_IS_CALLEE_SAVE(native_irq_enable),
> +#endif /* CONFIG_PARAVIRT_XXL */
> +#ifdef CONFIG_PARAVIRT_XL
>  	.irq.safe_halt		= native_safe_halt,
>  	.irq.halt		= native_halt,
> -#endif /* CONFIG_PARAVIRT_XXL */
> +#endif /* CONFIG_PARAVIRT_XL */
>  
>  	/* Mmu ops. */
>  	.mmu.flush_tlb_user	= native_flush_tlb_local,
> diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
> index 6c5eb6f3f14f..20d0cb116557 100644
> --- a/arch/x86/mm/mem_encrypt_identity.c
> +++ b/arch/x86/mm/mem_encrypt_identity.c
> @@ -24,6 +24,7 @@
>   * be extended when new paravirt and debugging variants are added.)
>   */
>  #undef CONFIG_PARAVIRT
> +#undef CONFIG_PARAVIRT_XL
>  #undef CONFIG_PARAVIRT_XXL
>  #undef CONFIG_PARAVIRT_SPINLOCKS
>  
> -- 
> 2.25.1
> 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-27 14:29           ` Dave Hansen
@ 2021-04-27 19:18             ` Kuppuswamy, Sathyanarayanan
  2021-04-27 19:20               ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-04-27 19:18 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

Hi Dave,

On 4/27/21 7:29 AM, Dave Hansen wrote:
>> Do we need to rename the helper functions ?
>>
>> tdvmcall(), tdvmcall_out_r11()
> Yes.
> 
>> Also what about output structs?
>>
>> struct tdcall_output
>> struct tdvmcall_output
> Yes, they need sane, straightforward names which are not confusing too.
> 

Following is the rename diff. Please let me know if you agree with the
names used.

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 6c3c71bb57a0..95a6a6c6061a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h

-struct tdcall_output {
+struct tdx_module_output {
         u64 rcx;
         u64 rdx;
         u64 r8;
@@ -19,7 +19,7 @@ struct tdcall_output {
         u64 r11;
  };

-struct tdvmcall_output {
+struct tdx_hypercall_output {
         u64 r11;
         u64 r12;
         u64 r13;
@@ -33,12 +33,12 @@ bool is_tdx_guest(void);
  void __init tdx_early_init(void);

  /* Helper function used to communicate with the TDX module */
-u64 __tdcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
-            struct tdcall_output *out);
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+                     struct tdx_module_output *out);

  /* Helper function used to request services from VMM */
-u64 __tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
-              struct tdvmcall_output *out);
+u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+                   struct tdx_hypercall_output *out);

--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -8,11 +8,11 @@
  /*
   * Wrapper for use case that checks for error code and print warning message.
   */
-static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+static inline u64 tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
  {
         u64 err;

-       err = __tdvmcall(fn, r12, r13, r14, r15, NULL);
+       err = __tdx_hypercall(fn, r12, r13, r14, r15, NULL);

         if (err)
                 pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
@@ -24,13 +24,14 @@ static inline u64 tdvmcall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
  /*
   * Wrapper for the semi-common case where we need single output value (R11).
   */
-static inline u64 tdvmcall_out_r11(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+static inline u64 tdx_hypercall_out_r11(u64 fn, u64 r12, u64 r13,
+                                       u64 r14, u64 r15)


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions
  2021-04-27 19:18             ` Kuppuswamy, Sathyanarayanan
@ 2021-04-27 19:20               ` Dave Hansen
  2021-04-28 17:42                 ` [PATCH v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() " Kuppuswamy Sathyanarayanan
  2021-05-19  5:58                 ` [RFC v2-fix-v1 " Kuppuswamy Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-04-27 19:20 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/27/21 12:18 PM, Kuppuswamy, Sathyanarayanan wrote:
> Following is the rename diff. Please let me know if you agree with the
> names used.

Look fine at a glance, but the real key is how they look when they get used.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* [PATCH v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-04-27 19:20               ` Dave Hansen
@ 2021-04-28 17:42                 ` Kuppuswamy Sathyanarayanan
  2021-05-19  5:58                 ` [RFC v2-fix-v1 " Kuppuswamy Sathyanarayanan
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-04-28 17:42 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Kuppuswamy Sathyanarayanan

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
expose guest state to the host.  This prevents the old hypercall
mechanisms from working. So to communicate with VMM, TDX
specification defines a new instruction called "tdcall".

In TDX based VM, since VMM is an untrusted entity, a intermediary
layer (TDX module) exists between host and guest to facilitate the
secure communication. And "tdcall" instruction  is used by the guest
to request services from TDX module. And a variant of "tdcall"
instruction (with specific arguments as defined by GHCI) is used by
the guest to request services from  VMM via the TDX module.

Implement common helper functions to communicate with the TDX Module
and VMM (using TDCALL instruction).
   
__tdx_hypercall()    - function can be used to request services from
		       the VMM.
__tdx_module_call()  - function can be used to communicate with the
		       TDX Module.

Also define two additional wrappers, tdx_hypercall() and
tdx_hypercall_out_r11() to cover common use cases of
__tdx_hypercall() function. Since each use case of
__tdx_module_call() is different, we don't need such wrappers for it.

Implement __tdx_module_call() and __tdx_hypercall() helper functions
in assembly.

Rationale behind choosing to use assembly over inline assembly are,

1. Since the number of lines of instructions (with comments) in
__tdx_hypercall() implementation is over 70, using inline assembly
to implement it will make it hard to read.
   
2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
TDCALL operation, if all these registers are included in in-line
assembly constraints, some of the older compilers may not
be able to meet this requirement.

Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
use the same set of argument registers. The implementation here picks
the current worst-case scenario for TDCALL (4 registers). For TDCALLs
with fewer than 4 arguments, there will end up being a few superfluous
(cheap) instructions.  But, this approach maximizes code reuse. The
same argument applies to __tdx_hypercall() function as well.

Current implementation of __tdx_hypercall() includes error handling
(ud2 on failure case) in assembly function instead of doing it in C
wrapper function. The reason behind this choice is, when adding support
for in/out instructions (refer to patch titled "x86/tdx: Handle port
I/O" in this series), we use alternative_io() to substitute in/out
instruction with  __tdx_hypercall() calls. So use of C wrappers is not
trivial in this case because the input parameters will be in the wrong
registers and it's tricky to include proper buffer code to make this
happen.

For registers used by TDCALL instruction, please check TDX GHCI
specification, sec 2.4 and 3.

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Originally-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Hi Dave,

It includes all fixes suggested by you. Please let me know your
comments.

Changes since v1:
 * Renamed __tdcall()/__tdvmcall() to
   __tdx_module_call()/__tdx_hypercall().
 * Used BIT() to derive TDVMCALL_EXPOSE_REGS_MASK.
 * Removed unnecessary code in __tdcall() function.
 * Fixed comments as per Dave's review.

 arch/x86/include/asm/tdx.h    |  26 ++++
 arch/x86/kernel/Makefile      |   2 +-
 arch/x86/kernel/asm-offsets.c |  22 ++++
 arch/x86/kernel/tdcall.S      | 215 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c         |  39 ++++++
 5 files changed, 303 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..95a6a6c6061a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,38 @@
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
+#include <linux/types.h>
+
+struct tdx_module_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+struct tdx_hypercall_output {
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+};
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
 void __init tdx_early_init(void);
 
+/* Helper function used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		      struct tdx_module_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+		    struct tdx_hypercall_output *out);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..e6b3bb983992 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
 #include <xen/interface/xen.h>
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
 #ifdef CONFIG_X86_32
 # include "asm-offsets_32.c"
 #else
@@ -75,6 +79,24 @@ static void __used common(void)
 	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+	BLANK();
+	/* Offset for fields in tdcall_output */
+	OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+	OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+	OFFSET(TDX_MODULE_r8,  tdx_module_output, r8);
+	OFFSET(TDX_MODULE_r9,  tdx_module_output, r9);
+	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
+	/* Offset for fields in tdvmcall_output */
+	OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
+	OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
+	OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
+	OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
+	OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
+#endif
+
 	BLANK();
 	OFFSET(BP_scratch, boot_params, scratch);
 	OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..7e14b4a2312e
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,215 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+
+#define TDG_R10		BIT(10)
+#define TDG_R11		BIT(11)
+#define TDG_R12		BIT(12)
+#define TDG_R13		BIT(13)
+#define TDG_R14		BIT(14)
+#define TDG_R15		BIT(15)
+
+/*
+ * Expose registers R10-R15 to VMM. It is passed via RCX register
+ * to the TDX Module, which will be used by the TDX module to
+ * identify the list of registers exposed to VMM. Each bit in this
+ * mask represents a register ID. You can find the bit field details
+ * in TDX GHCI specification.
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK	( TDG_R10 | TDG_R11 | \
+					  TDG_R12 | TDG_R13 | \
+					  TDG_R14 | TDG_R15 )
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdx_module_call()  - Helper function used by TDX guests to request
+ * services from the TDX module (does not include VMM services).
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with the
+ * TDX module.  And if the "tdcall" operation is successful and a valid
+ * "struct tdx_module_output" pointer is available (in "out" argument),
+ * output from the TDX module is saved to the memory specified in the
+ * "out" pointer. Also the status of the "tdcall" operation is returned
+ * back to the user as a function return value.
+ *
+ * @fn  (RDI)		- TDCALL Leaf ID,    moved to RAX
+ * @rcx (RSI)		- Input parameter 1, moved to RCX
+ * @rdx (RDX)		- Input parameter 2, moved to RDX
+ * @r8  (RCX)		- Input parameter 3, moved to R8
+ * @r9  (R8)		- Input parameter 4, moved to R9
+ *
+ * @out (R9)		- struct tdx_module_output pointer
+ *			  stored temporarily in R12 (not
+ * 			  shared with the TDX module)
+ *
+ * Return status of tdcall via RAX.
+ *
+ * NOTE: This function should not be used for TDX hypercall
+ *       use cases.
+ */
+SYM_FUNC_START(__tdx_module_call)
+	FRAME_BEGIN
+
+	/*
+	 * R12 will be used as temporary storage for
+	 * struct tdx_module_output pointer. You can
+	 * find struct tdx_module_output details in
+	 * arch/x86/include/asm/tdx.h. Also note that
+	 * registers R12-R15 are not used by TDCALL
+	 * services supported by this helper function.
+	 */
+	push %r12	/* Callee saved, so preserve it */
+	mov %r9,  %r12 	/* Move output pointer to R12 */
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	mov %rdi, %rax	/* Move TDCALL Leaf ID to RAX */
+	mov %r8,  %r9	/* Move input 4 to R9 */
+	mov %rcx, %r8	/* Move input 3 to R8 */
+	mov %rsi, %rcx	/* Move input 1 to RCX */
+	/* Leave input param 2 in RDX */
+
+	tdcall
+
+	/* Check for TDCALL success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check for TDCALL output struct != NULL */
+	test %r12, %r12
+	jz 1f
+
+	/* Copy TDCALL result registers to output struct: */
+	movq %rcx, TDX_MODULE_rcx(%r12)
+	movq %rdx, TDX_MODULE_rdx(%r12)
+	movq %r8,  TDX_MODULE_r8(%r12)
+	movq %r9,  TDX_MODULE_r9(%r12)
+	movq %r10, TDX_MODULE_r10(%r12)
+	movq %r11, TDX_MODULE_r11(%r12)
+1:
+	pop %r12 /* Restore the state of R12 register */
+
+	FRAME_END
+	ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * do_tdx_hypercall()  - Helper function used by TDX guests to request
+ * services from the VMM. All requests are made via the TDX module
+ * using "TDCALL" instruction.
+ *
+ * This function is created to contain common between vendor specific
+ * and standard type tdx hypercalls. So the caller of this function had
+ * to set the TDVMCALL type in the R10 register before calling it.
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with VMM
+ * via the TDX module. And if the "tdcall" operation is successful and a
+ * valid "struct tdx_hypercall_output" pointer is available (in "out"
+ * argument), output from the VMM is saved to the memory specified in the
+ * "out" pointer. 
+ *
+ * @fn  (RDI)		- TDVMCALL function, moved to R11
+ * @r12 (RSI)		- Input parameter 1, moved to R12
+ * @r13 (RDX)		- Input parameter 2, moved to R13
+ * @r14 (RCX)		- Input parameter 3, moved to R14
+ * @r15 (R8)		- Input parameter 4, moved to R15
+ *
+ * @out (R9)		- struct tdx_hypercall_output pointer
+ *
+ * On successful completion, return TDX hypercall error code.
+ * If the "tdcall" operation fails, panic.
+ *
+ */
+SYM_CODE_START_LOCAL(do_tdx_hypercall)
+	FRAME_BEGIN
+
+	/* Save non-volatile GPRs that are exposed to the VMM. */
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+
+	/* Leave hypercall output pointer in R9, it's not clobbered by VMM */
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	xor %eax, %eax /* Move TDCALL leaf ID (TDVMCALL (0)) to RAX */
+	mov %rdi, %r11 /* Move TDVMCALL function id to R11 */
+	mov %rsi, %r12 /* Move input 1 to R12 */
+	mov %rdx, %r13 /* Move input 2 to R13 */
+	mov %rcx, %r14 /* Move input 1 to R14 */
+	mov %r8,  %r15 /* Move input 1 to R15 */
+	/* Caller of do_tdx_hypercall() will set TDVMCALL type in R10 */
+
+	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+	tdcall
+
+	/*
+	 * Check for TDCALL success: 0 - Successful, otherwise failed.
+	 * If failed, there is an issue with TDX Module which is fatal
+	 * for the guest. So panic. Also note that RAX is controlled
+	 * only by the TDX module and not exposed to VMM.
+	 */
+	test %rax, %rax
+	jnz 2f
+
+	/* Move hypercall error code to RAX to return to user */
+	mov %r10, %rax
+
+	/* Check for hypercall success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check for hypercall output struct != NULL */
+	test %r9, %r9
+	jz 1f
+
+	/* Copy hypercall result registers to output struct: */
+	movq %r11, TDX_HYPERCALL_r11(%r9)
+	movq %r12, TDX_HYPERCALL_r12(%r9)
+	movq %r13, TDX_HYPERCALL_r13(%r9)
+	movq %r14, TDX_HYPERCALL_r14(%r9)
+	movq %r15, TDX_HYPERCALL_r15(%r9)
+1:
+	/*
+	 * Zero out registers exposed to the VMM to avoid
+	 * speculative execution with VMM-controlled values.
+	 */
+	xor %r10d, %r10d
+	xor %r11d, %r11d
+	xor %r12d, %r12d
+	xor %r13d, %r13d
+	xor %r14d, %r14d
+	xor %r15d, %r15d
+
+	/* Restore non-volatile GPRs that are exposed to the VMM. */
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	FRAME_END
+	ret
+2:
+	ud2
+SYM_CODE_END(do_tdx_hypercall)
+
+/* Helper function for standard type of TDVMCALL */
+SYM_FUNC_START(__tdx_hypercall)
+	/* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+	xor %r10, %r10
+	call do_tdx_hypercall
+	retq
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 6a7193fead08..cbfefc42641e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -1,8 +1,47 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (C) 2020 Intel Corporation */
 
+#define pr_fmt(fmt) "TDX: " fmt
+
 #include <asm/tdx.h>
 
+/*
+ * Wrapper for simple hypercalls that only return a success/error code.
+ */
+static inline u64 tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+	u64 err;
+
+	err = __tdx_hypercall(fn, r12, r13, r14, r15, NULL);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return err;
+}
+
+/*
+ * Wrapper for the semi-common case where we need single output
+ * value (R11). Callers of this function does not care about the
+ * hypercall error code (mainly for IN or MMIO usecase).
+ */
+static inline u64 tdx_hypercall_out_r11(u64 fn, u64 r12, u64 r13,
+					u64 r14, u64 r15)
+{
+
+	struct tdx_hypercall_output out = {0};
+	u64 err;
+
+	err = __tdx_hypercall(fn, r12, r13, r14, r15, &out);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return out.r11;
+}
+
 static inline bool cpuid_has_tdx_guest(void)
 {
 	u32 eax, signature[3];
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 00/32] Add TDX Guest Support
  2021-04-26 18:01 [RFC v2 00/32] Add TDX Guest Support Kuppuswamy Sathyanarayanan
                   ` (31 preceding siblings ...)
  2021-04-26 18:01 ` [RFC v2 32/32] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kuppuswamy Sathyanarayanan
@ 2021-05-03 23:21 ` Kuppuswamy, Sathyanarayanan
  32 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-03 23:21 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

Hi Peter/Andy,

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> Hi All,

Just a gentle ping. Please let me know your comments on this patch set.
I hope it addressed concerns raised by you in RFC v1.

> 
> NOTE: This series is not ready for wide public review. It is being
> specifically posted so that Peter Z and other experts on the entry
> code can look for problems with the new exception handler (#VE).
> That's also why x86@ is not being spammed.
> 
> Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
> hosts and some physical attacks. This series adds the bare-minimum
> support to run a TDX guest. The host-side support will be submitted
> separately. Also support for advanced TD guest features like attestation
> or debug-mode will be submitted separately. Also, at this point it is not
> secure with some known holes in drivers, and also hasn’t been fully audited
> and fuzzed yet.
> 
> TDX has a lot of similarities to SEV. It enhances confidentiality and
> of guest memory and state (like registers) and includes a new exception
> (#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
> yet), TDX limits the host's ability to effect changes in the guest
> physical address space.
> 
> In contrast to the SEV code in the kernel, TDX guest memory is integrity
> protected and isolated; the host is prevented from accessing guest
> memory (even ciphertext).
> 
> The TDX architecture also includes a new CPU mode called
> Secure-Arbitration Mode (SEAM). The software (TDX module) running in this
> mode arbitrates interactions between host and guest and implements many of
> the guarantees of the TDX architecture.
> 
> Some of the key differences between TD and regular VM is,
> 
> 1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
> 2. A new #VE exception handler is added. The TDX module injects #VE exception
>     to the guest TD in cases of instructions that need to be emulated, disallowed
>     MSR accesses, subset of CPUID leaves, etc.
> 3. By default memory is marked as private, and TD will selectively share it with
>     VMM based on need.
> 4. Remote attestation is supported to enable a third party (either the owner of
>     the workload or a user of the services provided by the workload) to establish
>     that the workload is running on an Intel-TDX-enabled platform located within a
>     TD prior to providing that workload data.
> 
> You can find TDX related documents in the following link.
> 
> https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html
> 
> Changes since v1:
>   * Implemented tdcall() and tdvmcall() helper functions in assembly and renamed
>     them as __tdcall() and __tdvmcall().
>   * Added do_general_protection() helper function to re-use protection
>     code between #GP exception and TDX #VE exception handlers.
>   * Addressed syscall gap issue in #VE handler support (for details check
>     the commit log in "x86/traps: Add #VE support for TDX guest").
>   * Modified patch titled "x86/tdx: Handle port I/O" to re-use common
>     tdvmcall() helper function.
>   * Added error handling support to MADT CPU wakeup code.
>   * Introduced enum tdx_map_type to identify SHARED vs PRIVATE memory type.
>   * Enabled shared memory in IOAPIC driver.
>   * Added BINUTILS version info for TDCALL.
>   * Changed the TDVMCALL vendor id from 0 to "TDX.KVM".
>   * Replaced WARN() with pr_warn_ratelimited() in __tdvmcall() wrappers.
>   * Fixed commit log and code comments related review comments.
>   * Renamed patch titled # "x86/topology: Disable CPU hotplug support for TDX
>     platforms" to "x86/topology: Disable CPU online/offline control for
>     TDX guest"
>   * Rebased on top of v5.12 kernel.
> 
> 
> Erik Kaneda (1):
>    ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Structure
> 
> Isaku Yamahata (1):
>    x86/tdx: ioapic: Add shared bit for IOAPIC base address
> 
> Kirill A. Shutemov (16):
>    x86/paravirt: Introduce CONFIG_PARAVIRT_XL
>    x86/tdx: Get TD execution environment information via TDINFO
>    x86/traps: Add #VE support for TDX guest
>    x86/tdx: Add HLT support for TDX guest
>    x86/tdx: Wire up KVM hypercalls
>    x86/tdx: Add MSR support for TDX guest
>    x86/tdx: Handle CPUID via #VE
>    x86/io: Allow to override inX() and outX() implementation
>    x86/tdx: Handle port I/O
>    x86/tdx: Handle in-kernel MMIO
>    x86/mm: Move force_dma_unencrypted() to common code
>    x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
>    x86/tdx: Make pages shared in ioremap()
>    x86/tdx: Add helper to do MapGPA TDVMALL
>    x86/tdx: Make DMA pages shared
>    x86/kvm: Use bounce buffers for TD guest
> 
> Kuppuswamy Sathyanarayanan (10):
>    x86/tdx: Introduce INTEL_TDX_GUEST config option
>    x86/cpufeatures: Add TDX Guest CPU feature
>    x86/x86: Add is_tdx_guest() interface
>    x86/tdx: Add __tdcall() and __tdvmcall() helper functions
>    x86/traps: Add do_general_protection() helper function
>    x86/tdx: Handle MWAIT, MONITOR and WBINVD
>    ACPICA: ACPI 6.4: MADT: add Multiprocessor Wakeup Mailbox Structure
>    ACPI/table: Print MADT Wake table information
>    x86/acpi, x86/boot: Add multiprocessor wake-up support
>    x86/topology: Disable CPU online/offline control for TDX guest
> 
> Sean Christopherson (4):
>    x86/boot: Add a trampoline for APs booting in 64-bit mode
>    x86/boot: Avoid #VE during compressed boot for TDX platforms
>    x86/boot: Avoid unnecessary #VE during boot process
>    x86/tdx: Forcefully disable legacy PIC for TDX guests
> 
>   arch/x86/Kconfig                         |  28 +-
>   arch/x86/boot/compressed/Makefile        |   2 +
>   arch/x86/boot/compressed/head_64.S       |  10 +-
>   arch/x86/boot/compressed/misc.h          |   1 +
>   arch/x86/boot/compressed/pgtable.h       |   2 +-
>   arch/x86/boot/compressed/tdcall.S        |   9 +
>   arch/x86/boot/compressed/tdx.c           |  32 ++
>   arch/x86/include/asm/apic.h              |   3 +
>   arch/x86/include/asm/cpufeatures.h       |   1 +
>   arch/x86/include/asm/idtentry.h          |   4 +
>   arch/x86/include/asm/io.h                |  24 +-
>   arch/x86/include/asm/irqflags.h          |  38 +-
>   arch/x86/include/asm/kvm_para.h          |  21 +
>   arch/x86/include/asm/paravirt.h          |  22 +-
>   arch/x86/include/asm/paravirt_types.h    |   3 +-
>   arch/x86/include/asm/pgtable.h           |   3 +
>   arch/x86/include/asm/realmode.h          |   1 +
>   arch/x86/include/asm/tdx.h               | 176 +++++++++
>   arch/x86/kernel/Makefile                 |   1 +
>   arch/x86/kernel/acpi/boot.c              |  79 ++++
>   arch/x86/kernel/apic/apic.c              |   8 +
>   arch/x86/kernel/apic/io_apic.c           |  12 +-
>   arch/x86/kernel/asm-offsets.c            |  22 ++
>   arch/x86/kernel/head64.c                 |   3 +
>   arch/x86/kernel/head_64.S                |  13 +-
>   arch/x86/kernel/idt.c                    |   6 +
>   arch/x86/kernel/paravirt.c               |   4 +-
>   arch/x86/kernel/pci-swiotlb.c            |   2 +-
>   arch/x86/kernel/smpboot.c                |   5 +
>   arch/x86/kernel/tdcall.S                 | 361 +++++++++++++++++
>   arch/x86/kernel/tdx-kvm.c                |  45 +++
>   arch/x86/kernel/tdx.c                    | 480 +++++++++++++++++++++++
>   arch/x86/kernel/topology.c               |   3 +-
>   arch/x86/kernel/traps.c                  |  81 ++--
>   arch/x86/mm/Makefile                     |   2 +
>   arch/x86/mm/ioremap.c                    |   8 +-
>   arch/x86/mm/mem_encrypt.c                |  75 ----
>   arch/x86/mm/mem_encrypt_common.c         |  85 ++++
>   arch/x86/mm/mem_encrypt_identity.c       |   1 +
>   arch/x86/mm/pat/set_memory.c             |  48 ++-
>   arch/x86/realmode/rm/header.S            |   1 +
>   arch/x86/realmode/rm/trampoline_64.S     |  49 ++-
>   arch/x86/realmode/rm/trampoline_common.S |   5 +-
>   drivers/acpi/tables.c                    |  11 +
>   include/acpi/actbl2.h                    |  26 +-
>   45 files changed, 1654 insertions(+), 162 deletions(-)
>   create mode 100644 arch/x86/boot/compressed/tdcall.S
>   create mode 100644 arch/x86/boot/compressed/tdx.c
>   create mode 100644 arch/x86/include/asm/tdx.h
>   create mode 100644 arch/x86/kernel/tdcall.S
>   create mode 100644 arch/x86/kernel/tdx-kvm.c
>   create mode 100644 arch/x86/kernel/tdx.c
>   create mode 100644 arch/x86/mm/mem_encrypt_common.c
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-04-27 17:31   ` Borislav Petkov
@ 2021-05-06 14:59     ` Kirill A. Shutemov
  2021-05-10  8:07     ` Juergen Gross
  1 sibling, 0 replies; 381+ messages in thread
From: Kirill A. Shutemov @ 2021-05-06 14:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Dan Williams, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel, Jürgen Gross

[-- Attachment #1: Type: text/plain, Size: 276 bytes --]

On Tue, Apr 27, 2021 at 07:31:09PM +0200, Borislav Petkov wrote:
> Or something to that effect.

See the couple of attached patches. Does look along the lines you wanted?

The first one renames PARAVIRT_XXL and the second one introduces
PARAVIRT_HLT.

-- 
 Kirill A. Shutemov

[-- Attachment #2: 0001-x86-paravirt-Rename-PARAVIRT_XXL-to-PARAVIRT_FULL.patch --]
[-- Type: text/x-diff, Size: 20397 bytes --]

From 2671a044687da0e0c7105beb3467a270b8863a1b Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Thu, 6 May 2021 17:04:42 +0300
Subject: [PATCH 1/2] x86/paravirt: Rename PARAVIRT_XXL to PARAVIRT_FULL

PARAVIRT_XXL provides a way to hook up a full set paravirt ops.
Rename it to PARAVIRT_FULL to be more self-descriptive.

It's a preparation for the next patch.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                               |  2 +-
 arch/x86/boot/compressed/misc.h                |  2 +-
 arch/x86/entry/vdso/vdso32/vclock_gettime.c    |  2 +-
 arch/x86/include/asm/debugreg.h                |  2 +-
 arch/x86/include/asm/desc.h                    |  4 ++--
 arch/x86/include/asm/fixmap.h                  |  4 ++--
 arch/x86/include/asm/io_bitmap.h               |  2 +-
 arch/x86/include/asm/irqflags.h                |  4 ++--
 arch/x86/include/asm/mmu_context.h             |  4 ++--
 arch/x86/include/asm/msr.h                     |  4 ++--
 arch/x86/include/asm/paravirt.h                | 12 ++++++------
 arch/x86/include/asm/paravirt_types.h          | 10 +++++-----
 arch/x86/include/asm/pgalloc.h                 |  2 +-
 arch/x86/include/asm/pgtable.h                 |  6 +++---
 arch/x86/include/asm/processor.h               |  4 ++--
 arch/x86/include/asm/ptrace.h                  |  2 +-
 arch/x86/include/asm/required-features.h       |  2 +-
 arch/x86/include/asm/special_insns.h           |  4 ++--
 arch/x86/kernel/asm-offsets.c                  |  2 +-
 arch/x86/kernel/asm-offsets_64.c               |  2 +-
 arch/x86/kernel/paravirt.c                     | 18 +++++++++---------
 arch/x86/kernel/paravirt_patch.c               |  6 +++---
 arch/x86/mm/mem_encrypt_identity.c             |  2 +-
 arch/x86/xen/Kconfig                           |  2 +-
 tools/arch/x86/include/asm/required-features.h |  2 +-
 25 files changed, 53 insertions(+), 53 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..568b96e20d59 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -783,7 +783,7 @@ config PARAVIRT
 	  over full virtualization.  However, when run without a hypervisor
 	  the kernel is theoretically slower and slightly larger.
 
-config PARAVIRT_XXL
+config PARAVIRT_FULL
 	bool
 
 config PARAVIRT_DEBUG
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 901ea5ebec22..0e5713c1cb86 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -9,7 +9,7 @@
  * paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
-#undef CONFIG_PARAVIRT_XXL
+#undef CONFIG_PARAVIRT_FULL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 #undef CONFIG_KASAN
 #undef CONFIG_KASAN_GENERIC
diff --git a/arch/x86/entry/vdso/vdso32/vclock_gettime.c b/arch/x86/entry/vdso/vdso32/vclock_gettime.c
index 283ed9d00426..6f543b40b1f4 100644
--- a/arch/x86/entry/vdso/vdso32/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vdso32/vclock_gettime.c
@@ -14,7 +14,7 @@
 #undef CONFIG_ILLEGAL_POINTER_VALUE
 #undef CONFIG_SPARSEMEM_VMEMMAP
 #undef CONFIG_NR_CPUS
-#undef CONFIG_PARAVIRT_XXL
+#undef CONFIG_PARAVIRT_FULL
 
 #define CONFIG_X86_32 1
 #define CONFIG_PGTABLE_LEVELS 2
diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h
index cfdf307ddc01..c4c9b9cbda55 100644
--- a/arch/x86/include/asm/debugreg.h
+++ b/arch/x86/include/asm/debugreg.h
@@ -8,7 +8,7 @@
 
 DECLARE_PER_CPU(unsigned long, cpu_dr7);
 
-#ifndef CONFIG_PARAVIRT_XXL
+#ifndef CONFIG_PARAVIRT_FULL
 /*
  * These special macros can be used to get or set a debugging register
  */
diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 476082a83d1c..51b77118307b 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -103,7 +103,7 @@ static inline int desc_empty(const void *ptr)
 	return !(desc[0] | desc[1]);
 }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
 #else
 #define load_TR_desc()				native_load_tr_desc()
@@ -129,7 +129,7 @@ static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
 static inline void paravirt_free_ldt(struct desc_struct *ldt, unsigned entries)
 {
 }
-#endif	/* CONFIG_PARAVIRT_XXL */
+#endif	/* CONFIG_PARAVIRT_FULL */
 
 #define store_ldt(ldt) asm("sldt %0" : "=m"(ldt))
 
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index d0dcefb5cc59..a0a4db7b255e 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -105,7 +105,7 @@ enum fixed_addresses {
 	FIX_PCIE_MCFG,
 #endif
 #endif
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	FIX_PARAVIRT_BOOTMAP,
 #endif
 
@@ -160,7 +160,7 @@ void __native_set_fixmap(enum fixed_addresses idx, pte_t pte);
 void native_set_fixmap(unsigned /* enum fixed_addresses */ idx,
 		       phys_addr_t phys, pgprot_t flags);
 
-#ifndef CONFIG_PARAVIRT_XXL
+#ifndef CONFIG_PARAVIRT_FULL
 static inline void __set_fixmap(enum fixed_addresses idx,
 				phys_addr_t phys, pgprot_t flags)
 {
diff --git a/arch/x86/include/asm/io_bitmap.h b/arch/x86/include/asm/io_bitmap.h
index 7f080f5c7def..2c20cd0669d3 100644
--- a/arch/x86/include/asm/io_bitmap.h
+++ b/arch/x86/include/asm/io_bitmap.h
@@ -36,7 +36,7 @@ static inline void native_tss_invalidate_io_bitmap(void)
 
 void native_tss_update_io_bitmap(void);
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
 #else
 #define tss_update_io_bitmap native_tss_update_io_bitmap
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 144d70ea4393..a4d7dbc2b034 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -59,7 +59,7 @@ static inline __cpuidle void native_halt(void)
 
 #endif
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
 #else
 #ifndef __ASSEMBLY__
@@ -124,7 +124,7 @@ static __always_inline unsigned long arch_local_irq_save(void)
 #endif
 
 #endif /* __ASSEMBLY__ */
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_FULL */
 
 #ifndef __ASSEMBLY__
 static __always_inline int arch_irqs_disabled_flags(unsigned long flags)
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 27516046117a..98949a97daf3 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -15,12 +15,12 @@
 
 extern atomic64_t last_mm_ctx_id;
 
-#ifndef CONFIG_PARAVIRT_XXL
+#ifndef CONFIG_PARAVIRT_FULL
 static inline void paravirt_activate_mm(struct mm_struct *prev,
 					struct mm_struct *next)
 {
 }
-#endif	/* !CONFIG_PARAVIRT_XXL */
+#endif	/* !CONFIG_PARAVIRT_FULL */
 
 #ifdef CONFIG_PERF_EVENTS
 DECLARE_STATIC_KEY_FALSE(rdpmc_never_available_key);
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index e16cccdd0420..7d1c97093780 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -251,7 +251,7 @@ static inline unsigned long long native_read_pmc(int counter)
 	return EAX_EDX_VAL(val, low, high);
 }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
 #else
 #include <linux/errno.h>
@@ -314,7 +314,7 @@ do {							\
 
 #define rdpmcl(counter, val) ((val) = native_read_pmc(counter))
 
-#endif	/* !CONFIG_PARAVIRT_XXL */
+#endif	/* !CONFIG_PARAVIRT_FULL */
 
 /*
  * 64-bit version of wrmsr_safe():
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4abf110e2243..02751519b0d9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,7 +84,7 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 	PVOP_VCALL1(mmu.exit_mmap, mm);
 }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 static inline void load_sp0(unsigned long sp0)
 {
 	PVOP_VCALL1(cpu.load_sp0, sp0);
@@ -642,7 +642,7 @@ bool __raw_callee_save___native_vcpu_is_preempted(long cpu);
 #define __PV_IS_CALLEE_SAVE(func)			\
 	((struct paravirt_callee_save) { func })
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 static inline notrace unsigned long arch_local_save_flags(void)
 {
 	return PVOP_CALLEE0(unsigned long, irq.save_fl);
@@ -748,7 +748,7 @@ extern void default_banner(void);
 #define PARA_INDIRECT(addr)	*%cs:addr
 #endif
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #define INTERRUPT_RETURN						\
 	PARA_SITE(PARA_PATCH(PV_CPU_iret),				\
 		  ANNOTATE_RETPOLINE_SAFE;				\
@@ -770,7 +770,7 @@ extern void default_banner(void);
 #endif
 
 #ifdef CONFIG_X86_64
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #ifdef CONFIG_DEBUG_ENTRY
 #define SAVE_FLAGS(clobbers)                                        \
 	PARA_SITE(PARA_PATCH(PV_IRQ_save_fl),			    \
@@ -779,7 +779,7 @@ extern void default_banner(void);
 		  call PARA_INDIRECT(pv_ops+PV_IRQ_save_fl);	    \
 		  PV_RESTORE_REGS(clobbers | CLBR_CALLEE_SAVE);)
 #endif
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_FULL */
 #endif	/* CONFIG_X86_64 */
 
 #endif /* __ASSEMBLY__ */
@@ -788,7 +788,7 @@ extern void default_banner(void);
 #endif /* !CONFIG_PARAVIRT */
 
 #ifndef __ASSEMBLY__
-#ifndef CONFIG_PARAVIRT_XXL
+#ifndef CONFIG_PARAVIRT_FULL
 static inline void paravirt_arch_dup_mmap(struct mm_struct *oldmm,
 					  struct mm_struct *mm)
 {
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index de87087d3bde..ae3503b2e8a2 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -66,7 +66,7 @@ struct paravirt_callee_save {
 
 /* general info */
 struct pv_info {
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	u16 extra_user_64bit_cs;  /* __USER_CS if none */
 #endif
 
@@ -86,7 +86,7 @@ struct pv_init_ops {
 			  unsigned long addr, unsigned len);
 } __no_randomize_layout;
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 struct pv_lazy_ops {
 	/* Set deferred update mode, used for batching operations. */
 	void (*enter)(void);
@@ -104,7 +104,7 @@ struct pv_cpu_ops {
 	/* hooks for various privileged instructions */
 	void (*io_delay)(void);
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	unsigned long (*get_debugreg)(int regno);
 	void (*set_debugreg)(int regno, unsigned long value);
 
@@ -166,7 +166,7 @@ struct pv_cpu_ops {
 } __no_randomize_layout;
 
 struct pv_irq_ops {
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	/*
 	 * Get/set interrupt state.  save_fl is expected to use X86_EFLAGS_IF;
 	 * all other bits returned from save_fl are undefined.
@@ -196,7 +196,7 @@ struct pv_mmu_ops {
 	/* Hook for intercepting the destruction of an mm_struct. */
 	void (*exit_mmap)(struct mm_struct *mm);
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	struct paravirt_callee_save read_cr2;
 	void (*write_cr2)(unsigned long);
 
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index 62ad61d6fefc..7bd2744b52ba 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -12,7 +12,7 @@
 
 static inline int  __paravirt_pgd_alloc(struct mm_struct *mm) { return 0; }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
 #else
 #define paravirt_pgd_alloc(mm)	__paravirt_pgd_alloc(mm)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..8c4eecc0444a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -59,9 +59,9 @@ extern struct mm_struct *pgd_page_get_mm(struct page *page);
 
 extern pmdval_t early_pmd_flags;
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
-#else  /* !CONFIG_PARAVIRT_XXL */
+#else  /* !CONFIG_PARAVIRT_FULL */
 #define set_pte(ptep, pte)		native_set_pte(ptep, pte)
 
 #define set_pte_atomic(ptep, pte)					\
@@ -115,7 +115,7 @@ extern pmdval_t early_pmd_flags;
 #define __pte(x)	native_make_pte(x)
 
 #define arch_end_context_switch(prev)	do {} while(0)
-#endif	/* CONFIG_PARAVIRT_XXL */
+#endif	/* CONFIG_PARAVIRT_FULL */
 
 /*
  * The following only work if pte_present() is true.
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f1b9ed5efaa9..47c4eb146a87 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -580,7 +580,7 @@ static inline bool on_thread_stack(void)
 			       current_stack_pointer) < THREAD_SIZE;
 }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
 #else
 #define __cpuid			native_cpuid
@@ -590,7 +590,7 @@ static inline void load_sp0(unsigned long sp0)
 	native_load_sp0(sp0);
 }
 
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_FULL */
 
 /* Free all resources held by a thread. */
 extern void release_thread(struct task_struct *);
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 409f661481e1..0f8adc38fc03 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -146,7 +146,7 @@ static inline int v8086_mode(struct pt_regs *regs)
 static inline bool user_64bit_mode(struct pt_regs *regs)
 {
 #ifdef CONFIG_X86_64
-#ifndef CONFIG_PARAVIRT_XXL
+#ifndef CONFIG_PARAVIRT_FULL
 	/*
 	 * On non-paravirt systems, this is the only long mode CPL 3
 	 * selector.  We do not allow long mode selectors in the LDT.
diff --git a/arch/x86/include/asm/required-features.h b/arch/x86/include/asm/required-features.h
index b2d504f11937..e37ef3a4cbd3 100644
--- a/arch/x86/include/asm/required-features.h
+++ b/arch/x86/include/asm/required-features.h
@@ -54,7 +54,7 @@
 #endif
 
 #ifdef CONFIG_X86_64
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 /* Paravirtualized systems may not have PSE or PGE available */
 #define NEED_PSE	0
 #define NEED_PGE	0
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 1d3cbaef4bb7..f26fc9acf4cc 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -148,7 +148,7 @@ static inline unsigned long __read_cr4(void)
 	return native_read_cr4();
 }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #include <asm/paravirt.h>
 #else
 
@@ -205,7 +205,7 @@ static inline void load_gs_index(unsigned int selector)
 
 #endif
 
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_FULL */
 
 static inline void clflush(volatile void *__p)
 {
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..cc247c723c5e 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -61,7 +61,7 @@ static void __used common(void)
 	OFFSET(IA32_RT_SIGFRAME_sigcontext, rt_sigframe_ia32, uc.uc_mcontext);
 #endif
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	BLANK();
 	OFFSET(PV_IRQ_irq_disable, paravirt_patch_template, irq.irq_disable);
 	OFFSET(PV_IRQ_irq_enable, paravirt_patch_template, irq.irq_enable);
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index b14533af7676..7bc5cb486eca 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -12,7 +12,7 @@
 int main(void)
 {
 #ifdef CONFIG_PARAVIRT
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 #ifdef CONFIG_DEBUG_ENTRY
 	OFFSET(PV_IRQ_save_fl, paravirt_patch_template, irq.save_fl);
 #endif
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c60222ab8ab9..e3a5f0cf9340 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -79,7 +79,7 @@ static unsigned paravirt_patch_call(void *insn_buff, const void *target,
 	return call_len;
 }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 /* identity function, which can be inlined */
 u64 notrace _paravirt_ident_64(u64 x)
 {
@@ -130,7 +130,7 @@ unsigned paravirt_patch_default(u8 type, void *insn_buff,
 	else if (opfunc == _paravirt_nop)
 		ret = 0;
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	/* identity functions just return their single argument */
 	else if (opfunc == _paravirt_ident_64)
 		ret = paravirt_patch_ident_64(insn_buff, len);
@@ -227,7 +227,7 @@ void paravirt_flush_lazy_mmu(void)
 	preempt_enable();
 }
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 void paravirt_start_context_switch(struct task_struct *prev)
 {
 	BUG_ON(preemptible());
@@ -260,7 +260,7 @@ enum paravirt_lazy_mode paravirt_get_lazy_mode(void)
 
 struct pv_info pv_info = {
 	.name = "bare hardware",
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	.extra_user_64bit_cs = __USER_CS,
 #endif
 };
@@ -279,7 +279,7 @@ struct paravirt_patch_template pv_ops = {
 	/* Cpu ops. */
 	.cpu.io_delay		= native_io_delay,
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	.cpu.cpuid		= native_cpuid,
 	.cpu.get_debugreg	= native_get_debugreg,
 	.cpu.set_debugreg	= native_set_debugreg,
@@ -324,7 +324,7 @@ struct paravirt_patch_template pv_ops = {
 	.irq.irq_enable		= __PV_IS_CALLEE_SAVE(native_irq_enable),
 	.irq.safe_halt		= native_safe_halt,
 	.irq.halt		= native_halt,
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_FULL */
 
 	/* Mmu ops. */
 	.mmu.flush_tlb_user	= native_flush_tlb_local,
@@ -336,7 +336,7 @@ struct paravirt_patch_template pv_ops = {
 
 	.mmu.exit_mmap		= paravirt_nop,
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	.mmu.read_cr2		= __PV_IS_CALLEE_SAVE(native_read_cr2),
 	.mmu.write_cr2		= native_write_cr2,
 	.mmu.read_cr3		= __native_read_cr3,
@@ -393,7 +393,7 @@ struct paravirt_patch_template pv_ops = {
 	},
 
 	.mmu.set_fixmap		= native_set_fixmap,
-#endif /* CONFIG_PARAVIRT_XXL */
+#endif /* CONFIG_PARAVIRT_FULL */
 
 #if defined(CONFIG_PARAVIRT_SPINLOCKS)
 	/* Lock ops. */
@@ -409,7 +409,7 @@ struct paravirt_patch_template pv_ops = {
 #endif
 };
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 /* At this point, native_get/set_debugreg has real function entries */
 NOKPROBE_SYMBOL(native_get_debugreg);
 NOKPROBE_SYMBOL(native_set_debugreg);
diff --git a/arch/x86/kernel/paravirt_patch.c b/arch/x86/kernel/paravirt_patch.c
index abd27ec67397..d100993dfdb3 100644
--- a/arch/x86/kernel/paravirt_patch.c
+++ b/arch/x86/kernel/paravirt_patch.c
@@ -17,7 +17,7 @@
 	case PARAVIRT_PATCH(ops.m):					\
 		return PATCH(data, ops##_##m, insn_buff, len)
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 struct patch_xxl {
 	const unsigned char	irq_irq_disable[1];
 	const unsigned char	irq_irq_enable[1];
@@ -44,7 +44,7 @@ unsigned int paravirt_patch_ident_64(void *insn_buff, unsigned int len)
 {
 	return PATCH(xxl, mov64, insn_buff, len);
 }
-# endif /* CONFIG_PARAVIRT_XXL */
+# endif /* CONFIG_PARAVIRT_FULL */
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 struct patch_lock {
@@ -68,7 +68,7 @@ unsigned int native_patch(u8 type, void *insn_buff, unsigned long addr,
 {
 	switch (type) {
 
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 	PATCH_CASE(irq, save_fl, xxl, insn_buff, len);
 	PATCH_CASE(irq, irq_enable, xxl, insn_buff, len);
 	PATCH_CASE(irq, irq_disable, xxl, insn_buff, len);
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 6c5eb6f3f14f..53fe895cefe2 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -24,7 +24,7 @@
  * be extended when new paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
-#undef CONFIG_PARAVIRT_XXL
+#undef CONFIG_PARAVIRT_FULL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 
 #include <linux/kernel.h>
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index afc1da68b06d..aa96670248e7 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -20,7 +20,7 @@ config XEN_PV
 	default y
 	depends on XEN
 	depends on X86_64
-	select PARAVIRT_XXL
+	select PARAVIRT_FULL
 	select XEN_HAVE_PVMMU
 	select XEN_HAVE_VPMU
 	help
diff --git a/tools/arch/x86/include/asm/required-features.h b/tools/arch/x86/include/asm/required-features.h
index b2d504f11937..e37ef3a4cbd3 100644
--- a/tools/arch/x86/include/asm/required-features.h
+++ b/tools/arch/x86/include/asm/required-features.h
@@ -54,7 +54,7 @@
 #endif
 
 #ifdef CONFIG_X86_64
-#ifdef CONFIG_PARAVIRT_XXL
+#ifdef CONFIG_PARAVIRT_FULL
 /* Paravirtualized systems may not have PSE or PGE available */
 #define NEED_PSE	0
 #define NEED_PGE	0
-- 
2.26.3


[-- Attachment #3: 0002-x86-paravirt-Introduce-CONFIG_PARAVIRT_HLT.patch --]
[-- Type: text/x-diff, Size: 5575 bytes --]

From 08e03cebb65b48e6b7e5bf0d805762cc661d09f0 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Tue, 19 Nov 2019 14:56:03 +0300
Subject: [PATCH 2/2] x86/paravirt: Introduce CONFIG_PARAVIRT_HLT

CONFIG_PARAVIRT_FULL provides a way to hook up the full set of paravirt
ops, but TDX only needs two of them: halt() and safe_halt().

Split off halt paravirt calls from CONFIG_PARAVIRT_FULL into
a separate config option -- CONFIG_PARAVIRT_HLT.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                      |  4 +++
 arch/x86/boot/compressed/misc.h       |  1 +
 arch/x86/include/asm/irqflags.h       | 38 +++++++++++++++------------
 arch/x86/include/asm/paravirt.h       | 22 +++++++++-------
 arch/x86/include/asm/paravirt_types.h |  3 ++-
 arch/x86/kernel/paravirt.c            |  4 ++-
 arch/x86/mm/mem_encrypt_identity.c    |  1 +
 7 files changed, 44 insertions(+), 29 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 568b96e20d59..830367e36d5a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -783,8 +783,12 @@ config PARAVIRT
 	  over full virtualization.  However, when run without a hypervisor
 	  the kernel is theoretically slower and slightly larger.
 
+config PARAVIRT_HLT
+	bool
+
 config PARAVIRT_FULL
 	bool
+	select PARAVIRT_HLT
 
 config PARAVIRT_DEBUG
 	bool "paravirt-ops debugging"
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 0e5713c1cb86..293f22dbada4 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -9,6 +9,7 @@
  * paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
+#undef CONFIG_PARAVIRT_HLT
 #undef CONFIG_PARAVIRT_FULL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 #undef CONFIG_KASAN
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index a4d7dbc2b034..ae839e74fc34 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -59,27 +59,11 @@ static inline __cpuidle void native_halt(void)
 
 #endif
 
-#ifdef CONFIG_PARAVIRT_FULL
+#ifdef CONFIG_PARAVIRT_HLT
 #include <asm/paravirt.h>
 #else
 #ifndef __ASSEMBLY__
 #include <linux/types.h>
-
-static __always_inline unsigned long arch_local_save_flags(void)
-{
-	return native_save_fl();
-}
-
-static __always_inline void arch_local_irq_disable(void)
-{
-	native_irq_disable();
-}
-
-static __always_inline void arch_local_irq_enable(void)
-{
-	native_irq_enable();
-}
-
 /*
  * Used in the idle loop; sti takes one instruction cycle
  * to complete:
@@ -97,6 +81,26 @@ static inline __cpuidle void halt(void)
 {
 	native_halt();
 }
+#endif /* !__ASSEMBLY__ */
+#endif /* CONFIG_PARAVIRT_HLT */
+
+#ifndef CONFIG_PARAVIRT_FULL
+#ifndef __ASSEMBLY__
+
+static __always_inline unsigned long arch_local_save_flags(void)
+{
+	return native_save_fl();
+}
+
+static __always_inline void arch_local_irq_disable(void)
+{
+	native_irq_disable();
+}
+
+static __always_inline void arch_local_irq_enable(void)
+{
+	native_irq_enable();
+}
 
 /*
  * For spinlocks, etc:
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 02751519b0d9..6bc5c1eab6eb 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,6 +84,18 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 	PVOP_VCALL1(mmu.exit_mmap, mm);
 }
 
+#ifdef CONFIG_PARAVIRT_HLT
+static inline void arch_safe_halt(void)
+{
+	PVOP_VCALL0(irq.safe_halt);
+}
+
+static inline void halt(void)
+{
+	PVOP_VCALL0(irq.halt);
+}
+#endif
+
 #ifdef CONFIG_PARAVIRT_FULL
 static inline void load_sp0(unsigned long sp0)
 {
@@ -145,16 +157,6 @@ static inline void __write_cr4(unsigned long x)
 	PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-static inline void arch_safe_halt(void)
-{
-	PVOP_VCALL0(irq.safe_halt);
-}
-
-static inline void halt(void)
-{
-	PVOP_VCALL0(irq.halt);
-}
-
 static inline void wbinvd(void)
 {
 	PVOP_VCALL0(cpu.wbinvd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index ae3503b2e8a2..cac32ffd95ca 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -177,7 +177,8 @@ struct pv_irq_ops {
 	struct paravirt_callee_save save_fl;
 	struct paravirt_callee_save irq_disable;
 	struct paravirt_callee_save irq_enable;
-
+#endif
+#ifdef CONFIG_PARAVIRT_HLT
 	void (*safe_halt)(void);
 	void (*halt)(void);
 #endif
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index e3a5f0cf9340..752fc3ab81bf 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -322,9 +322,11 @@ struct paravirt_patch_template pv_ops = {
 	.irq.save_fl		= __PV_IS_CALLEE_SAVE(native_save_fl),
 	.irq.irq_disable	= __PV_IS_CALLEE_SAVE(native_irq_disable),
 	.irq.irq_enable		= __PV_IS_CALLEE_SAVE(native_irq_enable),
+#endif /* CONFIG_PARAVIRT_FULL */
+#ifdef CONFIG_PARAVIRT_HLT
 	.irq.safe_halt		= native_safe_halt,
 	.irq.halt		= native_halt,
-#endif /* CONFIG_PARAVIRT_FULL */
+#endif /* CONFIG_PARAVIRT_HLT */
 
 	/* Mmu ops. */
 	.mmu.flush_tlb_user	= native_flush_tlb_local,
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 53fe895cefe2..7cb9b70edbe7 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -24,6 +24,7 @@
  * be extended when new paravirt and debugging variants are added.)
  */
 #undef CONFIG_PARAVIRT
+#undef CONFIG_PARAVIRT_HLT
 #undef CONFIG_PARAVIRT_FULL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 
-- 
2.26.3


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 07/32] x86/traps: Add do_general_protection() helper function
  2021-04-26 18:01 ` [RFC v2 07/32] x86/traps: Add do_general_protection() helper function Kuppuswamy Sathyanarayanan
@ 2021-05-07 21:20   ` Dave Hansen
  0 siblings, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-07 21:20 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> TDX guest #VE exception handler treats unsupported exceptions

 ^ The

> as #GP. So to handle the #GP, move the protection fault handler

s/So to/To/

Also, it does not "treat them as #GP".  It handles them in the same way
that a #GP is handled.  There's a difference between literally making
them a #GP and having a similar end result.  This description conflates
them.

> code to out of exc_general_protection() and create new helper
> function for it.

I wouldn't name the functions.  Just say that you want the #GP behavior
from #VE so you need a common helper.

> Also since exception handler is responsible to decide when to

	    ^ an

> turn on/off IRQ, move cond_local_irq_{enable/disable)() calls
> out of do_general_protection().

This paragraph doesn't really say anything meaningful.  Yes, exception
handlers reenable interrupts.  Try to *SAY* something about why they do
this and why you have to move the code around.  Or, just axe it.

> This is a preparatory patch for adding #VE exception handler
> support for TDX guests.
> 
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/kernel/traps.c | 51 ++++++++++++++++++++++-------------------
>  1 file changed, 27 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 651e3e508959..213d4aa8e337 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -527,44 +527,28 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
>  
>  #define GPFSTR "general protection fault"
>  
> -DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> +static void do_general_protection(struct pt_regs *regs, long error_code)
>  {
>  	char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
>  	enum kernel_gp_hint hint = GP_NO_HINT;
> -	struct task_struct *tsk;
> +	struct task_struct *tsk = current;
>  	unsigned long gp_addr;
>  	int ret;
>  
> -	cond_local_irq_enable(regs);
> -
> -	if (static_cpu_has(X86_FEATURE_UMIP)) {
> -		if (user_mode(regs) && fixup_umip_exception(regs))
> -			goto exit;
> -	}
> -
> -	if (v8086_mode(regs)) {
> -		local_irq_enable();
> -		handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
> -		local_irq_disable();
> -		return;
> -	}
> -
> -	tsk = current;
> -
>  	if (user_mode(regs)) {
>  		tsk->thread.error_code = error_code;
>  		tsk->thread.trap_nr = X86_TRAP_GP;
>  
>  		if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0))
> -			goto exit;
> +			return;
>  
>  		show_signal(tsk, SIGSEGV, "", desc, regs, error_code);
>  		force_sig(SIGSEGV);
> -		goto exit;
> +		return;
>  	}
>  
>  	if (fixup_exception(regs, X86_TRAP_GP, error_code, 0))
> -		goto exit;
> +		return;
>  
>  	tsk->thread.error_code = error_code;
>  	tsk->thread.trap_nr = X86_TRAP_GP;
> @@ -576,11 +560,11 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>  	if (!preemptible() &&
>  	    kprobe_running() &&
>  	    kprobe_fault_handler(regs, X86_TRAP_GP))
> -		goto exit;
> +		return;
>  
>  	ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV);

So... We're going to send signals based on #VE which use this bit in the
ABI which is documented as:

#define X86_TRAP_GP             13      /* General Protection Fault */

Considering that there is also a:

#define X86_TRAP_VE             20      /* Virtualization Exception */

this seems like a stretch.

Also, isnt there a lot of truly #GP-specific code in there, like
fixup_exception()?  Why do you need to call that for #VE?  How did you
decide what remains in the handler versus what gets separated out?

>  	if (ret == NOTIFY_STOP)
> -		goto exit;
> +		return;
>  
>  	if (error_code)
>  		snprintf(desc, sizeof(desc), "segment-related " GPFSTR);
> @@ -601,8 +585,27 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>  		gp_addr = 0;
>  
>  	die_addr(desc, regs, error_code, gp_addr);
> +}
>  
> -exit:
> +DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> +{
> +	cond_local_irq_enable(regs);
> +
> +	if (static_cpu_has(X86_FEATURE_UMIP)) {
> +		if (user_mode(regs) && fixup_umip_exception(regs)) {
> +			cond_local_irq_disable(regs);
> +			return;
> +		}
> +	}
> +
> +	if (v8086_mode(regs)) {
> +		local_irq_enable();
> +		handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
> +		local_irq_disable();
> +		return;
> +	}
> +
> +	do_general_protection(regs, error_code);
>  	cond_local_irq_disable(regs);
>  }


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-04-26 18:01 ` [RFC v2 08/32] x86/traps: Add #VE support for TDX guest Kuppuswamy Sathyanarayanan
@ 2021-05-07 21:36   ` Dave Hansen
  2021-05-13 19:47     ` Andi Kleen
  2021-05-18  0:09     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
  2021-06-08 17:02   ` [RFC v2 08/32] " Dave Hansen
  1 sibling, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-07 21:36 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
...
> The #VE cannot be nested before TDGETVEINFO is called, if there is any
> reason for it to nest the TD would shut down. The TDX module guarantees
> that no NMIs (or #MC or similar) can happen in this window. After
> TDGETVEINFO the #VE handler can nest if needed, although we don’t expect
> it to happen normally.

I think this description really needs some work.  Does "The #VE cannot
be nested" mean that "hardware guarantees that #VE will not be
generated", or "the #VE must not be nested"?

What does "the TD would shut down" mean?  I think you mean that instead
of delivering a nested #VE the hardware would actually exit to the host
and TDX would prevent the guest from being reentered.  Right?

> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index 5eb3bdf36a41..41a0732d5f68 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -619,6 +619,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER,	exc_xen_hypervisor_callback);
>  DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER,	exc_xen_unknown_trap);
>  #endif
>  
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +DECLARE_IDTENTRY(X86_TRAP_VE,		exc_virtualization_exception);
> +#endif
> +
>  /* Device interrupts common/spurious */
>  DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	common_interrupt);
>  #ifdef CONFIG_X86_LOCAL_APIC
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index c5a870cef0ae..1ca55d8e9963 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -11,6 +11,7 @@
>  #include <linux/types.h>
>  
>  #define TDINFO			1
> +#define TDGETVEINFO		3
>  
>  struct tdcall_output {
>  	u64 rcx;
> @@ -29,6 +30,20 @@ struct tdvmcall_output {
>  	u64 r15;
>  };
>  
> +struct ve_info {
> +	u64 exit_reason;
> +	u64 exit_qual;
> +	u64 gla;
> +	u64 gpa;
> +	u32 instr_len;
> +	u32 instr_info;
> +};

Is this an architectural structure or some software construct?

> +unsigned long tdg_get_ve_info(struct ve_info *ve);
> +
> +int tdg_handle_virtualization_exception(struct pt_regs *regs,
> +		struct ve_info *ve);
> +
>  /* Common API to check TDX support in decompression and common kernel code. */
>  bool is_tdx_guest(void);
>  
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index ee1a283f8e96..546b6b636c7d 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = {
>  	 */
>  	INTG(X86_TRAP_PF,		asm_exc_page_fault),
>  #endif
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
> +#endif
>  };
>  
>  /*
> @@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = {
>  	INTG(X86_TRAP_MF,		asm_exc_coprocessor_error),
>  	INTG(X86_TRAP_AC,		asm_exc_alignment_check),
>  	INTG(X86_TRAP_XF,		asm_exc_simd_coprocessor_error),
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
> +#endif
>  
>  #ifdef CONFIG_X86_32
>  	TSKG(X86_TRAP_DF,		GDT_ENTRY_DOUBLEFAULT_TSS),
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index b63275db1db9..ccfcb07bfb2c 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -82,6 +82,44 @@ static void tdg_get_info(void)
>  	td_info.attributes = out.rdx;
>  }
>  
> +unsigned long tdg_get_ve_info(struct ve_info *ve)
> +{
> +	u64 ret;
> +	struct tdcall_output out = {0};
> +
> +	/*
> +	 * The #VE cannot be nested before TDGETVEINFO is called,
> +	 * if there is any reason for it to nest the TD would shut
> +	 * down. The TDX module guarantees that no NMIs (or #MC or
> +	 * similar) can happen in this window. After TDGETVEINFO
> +	 * the #VE handler can nest if needed, although we don’t
> +	 * expect it to happen normally.
> +	 */

I find that description a bit unsatisfying.  Could we make this a bit
more concrete?  By the way, what about *normal* interrupts?

Maybe we should talk about this in terms of *rules* that folks need to
follow.  Maybe:

	NMIs and machine checks are suppressed.  Before this point any
	#VE is fatal.  After this point, NMIs and additional #VEs are
	permitted.

> +	ret = __tdcall(TDGETVEINFO, 0, 0, 0, 0, &out);
> +
> +	ve->exit_reason = out.rcx;
> +	ve->exit_qual   = out.rdx;
> +	ve->gla         = out.r8;
> +	ve->gpa         = out.r9;
> +	ve->instr_len   = out.r10 & UINT_MAX;
> +	ve->instr_info  = out.r10 >> 32;
> +
> +	return ret;
> +}
> +
> +int tdg_handle_virtualization_exception(struct pt_regs *regs,
> +		struct ve_info *ve)
> +{
> +	/*
> +	 * TODO: Add handler support for various #VE exit
> +	 * reasons. It will be added by other patches in
> +	 * the series.
> +	 */
> +	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> +	return -EFAULT;
> +}
> +
>  void __init tdx_early_init(void)
>  {
>  	if (!cpuid_has_tdx_guest())
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 213d4aa8e337..64869aa88a5a 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -61,6 +61,7 @@
>  #include <asm/insn.h>
>  #include <asm/insn-eval.h>
>  #include <asm/vdso.h>
> +#include <asm/tdx.h>
>  
>  #ifdef CONFIG_X86_64
>  #include <asm/x86_init.h>
> @@ -1140,6 +1141,35 @@ DEFINE_IDTENTRY(exc_device_not_available)
>  	}
>  }
>  
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +DEFINE_IDTENTRY(exc_virtualization_exception)
> +{
> +	struct ve_info ve;
> +	int ret;
> +
> +	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> +
> +	/*
> +	 * Consume #VE info before re-enabling interrupts. It will be
> +	 * re-enabled after executing the TDGETVEINFO TDCALL.
> +	 */

"It" is nebulous here.  Is this talking about NMIs, or the
cond_local_irq_enable() that is "after" TDGETVEINFO?

> +	ret = tdg_get_ve_info(&ve);
> +
> +	cond_local_irq_enable(regs);
> +
> +	if (!ret)
> +		ret = tdg_handle_virtualization_exception(regs, &ve);
> +	/*
> +	 * If tdg_handle_virtualization_exception() could not process
> +	 * it successfully, treat it as #GP(0) and handle it.
> +	 */
> +	if (ret)
> +		do_general_protection(regs, 0);
> +
> +	cond_local_irq_disable(regs);
> +}
> +#endif
> +
>  #ifdef CONFIG_X86_32
>  DEFINE_IDTENTRY_SW(iret_error)
>  {
> 


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-04-26 18:01 ` [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls Kuppuswamy Sathyanarayanan
@ 2021-05-07 21:46   ` Dave Hansen
  2021-05-08  0:59     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-07 21:46 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> KVM hypercalls have to be wrapped into vendor-specific TDVMCALLs.

How about:

KVM hypercalls use the "vmcall" or "vmmcall" instructions.  Although the
ABI is similar, those instructions no longer function for TDX guests.
Make TDVMCALLs instead of VMCALL/VMCALL.

> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 338119852512..2fa85481520b 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -6,6 +6,7 @@
>  #include <asm/alternative.h>
>  #include <linux/interrupt.h>
>  #include <uapi/asm/kvm_para.h>
> +#include <asm/tdx.h>
>  
>  extern void kvmclock_init(void);
>  
> @@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
>  static inline long kvm_hypercall0(unsigned int nr)
>  {
>  	long ret;
> +
> +	if (is_tdx_guest())
> +		return tdx_kvm_hypercall0(nr);

... all of these look OK.

>  #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> index 81af70c2acbd..964bfd7fc682 100644
> --- a/arch/x86/kernel/tdcall.S
> +++ b/arch/x86/kernel/tdcall.S
> @@ -11,6 +11,7 @@
>   * refer to TDX GHCI specification).
>   */
>  #define TDVMCALL_EXPOSE_REGS_MASK	0xfc00
> +#define TDVMCALL_VENDOR_KVM		0x4d564b2e584454 /* "TDX.KVM" */
>  
>  /*
>   * TDX guests use the TDCALL instruction to make
> @@ -198,3 +199,9 @@ SYM_FUNC_START(__tdvmcall)
>  	call do_tdvmcall
>  	retq
>  SYM_FUNC_END(__tdvmcall)
> +
> +SYM_FUNC_START(__tdvmcall_vendor_kvm)
> +	movq $TDVMCALL_VENDOR_KVM, %r10
> +	call do_tdvmcall
> +	retq
> +SYM_FUNC_END(__tdvmcall_vendor_kvm)

Granted, this is not a ton of assembly.  But, it does look a bit weird.
 It needs a comment and/or a mention in the changelog.

R10 is not part of the function call ABI, but it is a part of the
TDVMCALL ABI.  This little assembly wrapper lets us reuse do_tdvmcall()
for both KVM-specific hypercalls TDVMCALL_VENDOR_KVM and the more
generic __tdvmcalls.

> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -8,6 +8,10 @@
>  
>  #include <linux/cpu.h>
>  
> +#ifdef CONFIG_KVM_GUEST
> +#include "tdx-kvm.c"
> +#endif
> +
>  static struct {
>  	unsigned int gpa_width;
>  	unsigned long attributes;

I know KVM does weird stuff.  But, this is *really* weird.  Why are we
#including a .c file into another .c file?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 15/32] x86/tdx: Handle in-kernel MMIO
  2021-04-26 18:01 ` [RFC v2 15/32] x86/tdx: Handle in-kernel MMIO Kuppuswamy Sathyanarayanan
@ 2021-05-07 21:52   ` Dave Hansen
  2021-05-18  0:48     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
  2021-06-02 19:42     ` [RFC v2-fix-v2 0/2] " Kuppuswamy Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-07 21:52 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
> exit reason.

This needs a bit of a history lesson.  "In traditional VMs, MMIO tends
to be implemented by giving a guest access to an mapping which will
cause a VMEXIT on access.  That's not possible in a TDX guest..."

> For now we only handle subset of instruction that kernel uses for MMIO
> oerations. User-space access triggers SIGBUS.

I still don't think that TDX guests should be doing things that they
*KNOW* will cause #VE, including MMIO.  I really want to hear a more
discrete story about why this is the *best* way to do this for Linux
instead of just a hack from the Windows binary driver ecosystem that
seemed expedient.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-04-26 18:01 ` [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code Kuppuswamy Sathyanarayanan
@ 2021-05-07 21:54   ` Dave Hansen
  2021-05-10 22:19     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-07 21:54 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

> +++ b/arch/x86/mm/mem_encrypt_common.c
...
> +/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
> +bool force_dma_unencrypted(struct device *dev)
> +{
> +	/*
> +	 * For SEV, all DMA must be to unencrypted/shared addresses.
> +	 */
> +	if (sev_active())
> +		return true;
> +
> +	/*
> +	 * For SME, all DMA must be to unencrypted addresses if the
> +	 * device does not support DMA to addresses that include the
> +	 * encryption mask.
> +	 */
> +	if (sme_active()) {
> +		u64 dma_enc_mask = DMA_BIT_MASK(__ffs64(sme_me_mask));
> +		u64 dma_dev_mask = min_not_zero(dev->coherent_dma_mask,
> +						dev->bus_dma_limit);
> +
> +		if (dma_dev_mask <= dma_enc_mask)
> +			return true;
> +	}
> +
> +	return false;
> +}

This doesn't seem much like common code to me.  It seems like 100% SEV
code.  Is this really where we want to move it?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-04-26 18:01 ` [RFC v2 28/32] x86/tdx: Make pages shared in ioremap() Kuppuswamy Sathyanarayanan
@ 2021-05-07 21:55   ` Dave Hansen
  2021-05-07 22:38     ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-07 21:55 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>  static unsigned int __ioremap_check_encrypted(struct resource *res)
>  {
> -	if (!sev_active())
> +	if (!sev_active() && !is_tdx_guest())
>  		return 0;

I think it's time to come up with a real name for all of the code that's
under: (sev_active() || is_tdx_guest()).

"encrypted" isn't it, for sure.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-07 21:55   ` Dave Hansen
@ 2021-05-07 22:38     ` Andi Kleen
  2021-05-10 22:23       ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-07 22:38 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel


On 5/7/2021 2:55 PM, Dave Hansen wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>>   static unsigned int __ioremap_check_encrypted(struct resource *res)
>>   {
>> -	if (!sev_active())
>> +	if (!sev_active() && !is_tdx_guest())
>>   		return 0;
> I think it's time to come up with a real name for all of the code that's
> under: (sev_active() || is_tdx_guest()).
>
> "encrypted" isn't it, for sure.

I called it protected_guest() in some other patches.

-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 32/32] x86/tdx: ioapic: Add shared bit for IOAPIC base address
  2021-04-26 18:01 ` [RFC v2 32/32] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kuppuswamy Sathyanarayanan
@ 2021-05-07 23:06   ` Dave Hansen
  2021-05-24 23:29     ` [RFC v2-fix-v2 1/1] " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-07 23:06 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> IOAPIC is emulated by KVM which means its MMIO address is shared
> by host. Add shared bit for base address of IOAPIC.
> Most MMIO region is handled by ioremap which is already marked
> as shared for TDX guest platform, but IOAPIC is an exception which
> uses fixed map.

Ho hum...  I guess I'll rewrite the changelog:

The kernel interacts with each bare-metal IOAPIC with a special MMIO
page.  When running under KVM, the guest's IOAPICs are emulated by KVM.

When running as a TDX guest, the guest needs to mark each IOAPIC mapping
as "shared" with the host.  This ensures that TDX private protections
are not applied to the page, which allows the TDX host emulation to work.

Earlier patches in this series modified ioremap() so that
ioremap()-created mappings such as virtio will be marked as shared.
However, the IOAPIC code does not use ioremap() and instead uses the
fixmap mechanism.

Introduce a special fixmap helper just for the IOAPIC code.  Ensure that
it marks IOAPIC pages as "shared".  This replaces set_fixmap_nocache()
with __set_fixmap() since __set_fixmap() allows custom 'prot' values.

>  arch/x86/kernel/apic/io_apic.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> index 73ff4dd426a8..2a01d4a82be7 100644
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -2675,6 +2675,14 @@ static struct resource * __init ioapic_setup_resources(void)
>  	return res;
>  }
>  
> +static void io_apic_set_fixmap_nocache(enum fixed_addresses idx, phys_addr_t phys)
> +{
> +	pgprot_t flags = FIXMAP_PAGE_NOCACHE;
> +	if (is_tdx_guest())
> +		flags = pgprot_tdg_shared(flags);
> +	__set_fixmap(idx, phys, flags);
> +}

^ This seems like it could at least use a one-liner comment.

>  void __init io_apic_init_mappings(void)
>  {
>  	unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
> @@ -2707,7 +2715,7 @@ void __init io_apic_init_mappings(void)
>  				      __func__, PAGE_SIZE, PAGE_SIZE);
>  			ioapic_phys = __pa(ioapic_phys);
>  		}
> -		set_fixmap_nocache(idx, ioapic_phys);
> +		io_apic_set_fixmap_nocache(idx, ioapic_phys);
>  		apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
>  			__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
>  			ioapic_phys);
> @@ -2836,7 +2844,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
>  	ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
>  	ioapics[idx].mp_config.apicaddr = address;
>  
> -	set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
> +	io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
>  	if (bad_ioapic_register(idx)) {
>  		clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
>  		return -ENODEV;
> 


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-05-07 21:46   ` Dave Hansen
@ 2021-05-08  0:59     ` Kuppuswamy, Sathyanarayanan
  2021-05-12 13:00       ` Kirill A. Shutemov
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-08  0:59 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata



On 5/7/21 2:46 PM, Dave Hansen wrote:
> I know KVM does weird stuff.  But, this is*really*  weird.  Why are we
> #including a .c file into another .c file?

I think Kirill implemented it this way to skip Makefile changes for it. I don't
see any other KVM direct dependencies in tdx.c.

I will fix it in next version.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-04-27 17:31   ` Borislav Petkov
  2021-05-06 14:59     ` Kirill A. Shutemov
@ 2021-05-10  8:07     ` Juergen Gross
  2021-05-10 15:52       ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Juergen Gross @ 2021-05-10  8:07 UTC (permalink / raw)
  To: Borislav Petkov, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel


[-- Attachment #1.1.1: Type: text/plain, Size: 1016 bytes --]

On 27.04.21 19:31, Borislav Petkov wrote:
> + Jürgen.
> 
> On Mon, Apr 26, 2021 at 11:01:28AM -0700, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> Split off halt paravirt calls from CONFIG_PARAVIRT_XXL into
>> a separate config option. It provides a middle ground for
>> not-so-deep paravirtulized environments.
> 
> Please introduce a spellchecker into your patch creation workflow.
> 
> Also, what does "not-so-deep" mean?
> 
>> CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
>> calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
>> config would be a bloat for TDX.
> 
> Used how? Why is it bloat for TDX?

Is there any major downside to move the halt related pvops functions
from CONFIG_PARAVIRT_XXL to CONFIG_PARAVIRT?

I'd rather introduce a new PARAVIRT level only in case of multiple
pvops functions needed for a new guest type, or if a real hot path
would be affected.


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-05-10  8:07     ` Juergen Gross
@ 2021-05-10 15:52       ` Andi Kleen
  2021-05-10 15:56         ` Juergen Gross
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-10 15:52 UTC (permalink / raw)
  To: Juergen Gross, Borislav Petkov, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

\
>>> CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
>>> calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
>>> config would be a bloat for TDX.
>>
>> Used how? Why is it bloat for TDX?
>
> Is there any major downside to move the halt related pvops functions
> from CONFIG_PARAVIRT_XXL to CONFIG_PARAVIRT?

I think the main motivation is to get rid of all the page table related 
hooks for modern configurations. These are the bulk of the annotations 
and  cause bloat and worse code. Shadow page tables are really obscure 
these days and very few people still need them and it's totally 
reasonable to build even widely used distribution kernels without them. 
On contrast most of the other hooks are comparatively few and also on 
comparatively slow paths, so don't really matter too much.

I think it would be ok to have a CONFIG_PARAVIRT that does not have page 
table support, and a separate config option for those (that could be 
eventually deprecated).

But that would break existing .configs for those shadow stack users, 
that's why I think Kirill did it the other way around.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-05-10 15:52       ` Andi Kleen
@ 2021-05-10 15:56         ` Juergen Gross
  2021-05-12 12:07           ` Kirill A. Shutemov
  2021-05-12 13:18           ` Peter Zijlstra
  0 siblings, 2 replies; 381+ messages in thread
From: Juergen Gross @ 2021-05-10 15:56 UTC (permalink / raw)
  To: Andi Kleen, Borislav Petkov, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel


[-- Attachment #1.1.1: Type: text/plain, Size: 1500 bytes --]

On 10.05.21 17:52, Andi Kleen wrote:
> \
>>>> CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
>>>> calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
>>>> config would be a bloat for TDX.
>>>
>>> Used how? Why is it bloat for TDX?
>>
>> Is there any major downside to move the halt related pvops functions
>> from CONFIG_PARAVIRT_XXL to CONFIG_PARAVIRT?
> 
> I think the main motivation is to get rid of all the page table related 
> hooks for modern configurations. These are the bulk of the annotations 
> and  cause bloat and worse code. Shadow page tables are really obscure 
> these days and very few people still need them and it's totally 
> reasonable to build even widely used distribution kernels without them. 
> On contrast most of the other hooks are comparatively few and also on 
> comparatively slow paths, so don't really matter too much.
> 
> I think it would be ok to have a CONFIG_PARAVIRT that does not have page 
> table support, and a separate config option for those (that could be 
> eventually deprecated).
> 
> But that would break existing .configs for those shadow stack users, 
> that's why I think Kirill did it the other way around.

No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
other hypervisor's guests, supporting basically the TLB flush operations
and time related operations only. Adding the halt related operations to
PARAVIRT wouldn't break anything.


Juergen


[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-04-26 18:01 ` [RFC v2 14/32] x86/tdx: Handle port I/O Kuppuswamy Sathyanarayanan
@ 2021-05-10 21:57   ` Dan Williams
  2021-05-10 23:08     ` Andi Kleen
  2021-05-11 15:35     ` Dave Hansen
  0 siblings, 2 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-10 21:57 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Mon, Apr 26, 2021 at 11:02 AM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

While I do not expect that a patch in the middle of a series needs the
full introduction of all concepts, the high expectations of  the
reader's context in this changelog make the patch actively painful to
read.

Some connective tissue commentary to list assumptions and pointers to
definitions is needed to make this patch stand alone when a future
bisect lands on it and someone wonders where to get started debugging
it.

> Unroll string operations and handle port I/O through TDVMCALLs.
> Also handle #VE due to I/O operations with the same TDVMCALLs.

There is a mix of direct-TDVMCALL usage and handling #VE when and why
is either approached used?

> Decompression code uses port IO for earlyprintk. We must use
> paravirt calls there too if we want to allow earlyprintk.

What is the tradeoff between teaching the decompression code to handle
#VE (the implied assumption) vs teaching it to avoid #VE with direct
TDVMCALLs (the chosen direction)?

Rewrite without "we":

"Given the need to support earlyprintk for protected guests, deploy
paravirt calls for the io*() and out*() usage in the decompress code."

This raises the question of why the cover letter switched from
explicitly saying TDVMCALL to "paravirt" where it could be confused
with the typical paravirt helpers?

>
> Decompresion code cannot deal with alternatives: use branches

s/Decompresion/Decompression/

> instead to implement inX() and outX() helpers.
>
> Since we use call instruction in place of in/out instruction,
> the argument passed to call instruction has to be in a
> register, it cannot be an immediate value like in/out
> instruction. So change constraint flag from "Nd" to "d"

Rewrite without "we":

With the approach to use a "call" instruction as an alternative for an
"in/out" instruction it is no longer the case that the argument can be
an immediate value. Change the asm constraint flag from "Nd" to "d" to
accomodate.

>
> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/boot/compressed/Makefile |   1 +
>  arch/x86/boot/compressed/tdcall.S |   9 ++
>  arch/x86/include/asm/io.h         |   5 +-
>  arch/x86/include/asm/tdx.h        |  46 ++++++++-
>  arch/x86/kernel/tdcall.S          | 154 ++++++++++++++++++++++++++++++

Why is this named "tdcall" when it is implementing tdvmcalls? I must
say those names don't really help me understand what they do. Can we
have Linux names that don't mandate keeping the spec terminology in my
brain's translation cache?

>  arch/x86/kernel/tdx.c             |  33 +++++++
>  6 files changed, 245 insertions(+), 3 deletions(-)
>  create mode 100644 arch/x86/boot/compressed/tdcall.S
>
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index a2554621cefe..a944a2038797 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -97,6 +97,7 @@ endif
>
>  vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
>  vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
> +vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdcall.o
>
>  vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
>  efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
> diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
> new file mode 100644
> index 000000000000..5ebb80d45ad8
> --- /dev/null
> +++ b/arch/x86/boot/compressed/tdcall.S
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include <asm/export.h>
> +
> +/* Do not export symbols in decompression code */
> +#undef EXPORT_SYMBOL
> +#define EXPORT_SYMBOL(sym)

What's wrong with the existing:

KBUILD_CFLAGS += -D__DISABLE_EXPORTS

...in arch/x86/boot/compressed/Makefile?

> +
> +#include "../../kernel/tdcall.S"
> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> index ef7a686a55a9..30a3b30395ad 100644
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -43,6 +43,7 @@
>  #include <asm/page.h>
>  #include <asm/early_ioremap.h>
>  #include <asm/pgtable_types.h>
> +#include <asm/tdx.h>
>
>  #define build_mmio_read(name, size, type, reg, barrier) \
>  static inline type name(const volatile void __iomem *addr) \
> @@ -309,7 +310,7 @@ static inline unsigned type in##bwl##_p(int port)                   \
>                                                                         \
>  static inline void outs##bwl(int port, const void *addr, unsigned long count) \
>  {                                                                      \
> -       if (sev_key_active()) {                                         \
> +       if (sev_key_active() || is_tdx_guest()) {                       \

Is there a unified Linux name these can be given to stop the
proliferation of poor vendor names for similar concepts?

That routine

>                 unsigned type *value = (unsigned type *)addr;           \
>                 while (count) {                                         \
>                         out##bwl(*value, port);                         \
> @@ -325,7 +326,7 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
>                                                                         \
>  static inline void ins##bwl(int port, void *addr, unsigned long count) \
>  {                                                                      \
> -       if (sev_key_active()) {                                         \
> +       if (sev_key_active() || is_tdx_guest()) {                       \
>                 unsigned type *value = (unsigned type *)addr;           \
>                 while (count) {                                         \
>                         *value = in##bwl(port);                         \
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index e0b3ed9e262c..b972c6531a53 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -5,6 +5,8 @@
>
>  #define TDX_CPUID_LEAF_ID      0x21
>
> +#ifndef __ASSEMBLY__
> +
>  #ifdef CONFIG_INTEL_TDX_GUEST
>
>  #include <asm/cpufeature.h>
> @@ -67,6 +69,48 @@ long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
>  long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
>                 unsigned long p3, unsigned long p4);
>
> +/* Decompression code doesn't know how to handle alternatives */

Does it also not know how to handle #VE to keep it aligned with the
runtime code?

> +#ifdef BOOT_COMPRESSED_MISC_H
> +#define __out(bwl, bw)                                                 \
> +do {                                                                   \
> +       if (is_tdx_guest()) {                                           \
> +               asm volatile("call tdg_out" #bwl : :                    \
> +                               "a"(value), "d"(port));                 \
> +       } else {                                                        \
> +               asm volatile("out" #bwl " %" #bw "0, %w1" : :           \
> +                               "a"(value), "Nd"(port));                \
> +       }                                                               \
> +} while (0)
> +#define __in(bwl, bw)                                                  \
> +do {                                                                   \
> +       if (is_tdx_guest()) {                                           \
> +               asm volatile("call tdg_in" #bwl :                       \
> +                               "=a"(value) : "d"(port));               \
> +       } else {                                                        \
> +               asm volatile("in" #bwl " %w1, %" #bw "0" :              \
> +                               "=a"(value) : "Nd"(port));              \
> +       }                                                               \
> +} while (0)
> +#else
> +#define __out(bwl, bw)                                                 \
> +       alternative_input("out" #bwl " %" #bw "1, %w2",                 \
> +                       "call tdg_out" #bwl, X86_FEATURE_TDX_GUEST,     \
> +                       "a"(value), "d"(port))
> +
> +#define __in(bwl, bw)                                                  \
> +       alternative_io("in" #bwl " %w2, %" #bw "0",                     \
> +                       "call tdg_in" #bwl, X86_FEATURE_TDX_GUEST,      \
> +                       "=a"(value), "d"(port))

Outside the boot decompression code isn't this branch of the "ifdef
BOOT_COMPRESSED_MISC_H"  handled by #VE? I also don't see any usage of
__{in,out}() in this patch.

> +#endif
> +
> +void tdg_outb(unsigned char value, unsigned short port);
> +void tdg_outw(unsigned short value, unsigned short port);
> +void tdg_outl(unsigned int value, unsigned short port);
> +
> +unsigned char tdg_inb(unsigned short port);
> +unsigned short tdg_inw(unsigned short port);
> +unsigned int tdg_inl(unsigned short port);
> +
>  #else // !CONFIG_INTEL_TDX_GUEST
>
>  static inline bool is_tdx_guest(void)
> @@ -106,5 +150,5 @@ static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
>  }
>
>  #endif /* CONFIG_INTEL_TDX_GUEST */
> -
> +#endif /* __ASSEMBLY__ */
>  #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> index 964bfd7fc682..df4159bb5103 100644
> --- a/arch/x86/kernel/tdcall.S
> +++ b/arch/x86/kernel/tdcall.S
> @@ -3,6 +3,7 @@
>  #include <asm/asm.h>
>  #include <asm/frame.h>
>  #include <asm/unwind_hints.h>
> +#include <asm/export.h>
>
>  #include <linux/linkage.h>
>
> @@ -12,6 +13,12 @@
>   */
>  #define TDVMCALL_EXPOSE_REGS_MASK      0xfc00
>  #define TDVMCALL_VENDOR_KVM            0x4d564b2e584454 /* "TDX.KVM" */
> +#define EXIT_REASON_IO_INSTRUCTION     30
> +/*
> + * Current size of struct tdvmcall_output is 40 bytes,
> + * but allocate double to account future changes.

What future changes? Why could they not be handled as future code changes?

> + */
> +#define TDVMCALL_OUTPUT_SIZE           80

Perhaps "PAYLOAD_SIZE" since it is used for both input and output?

If the ABI does not include the size of the payload then how would
code detect if even 80 bytes was violated in the future?

>
>  /*
>   * TDX guests use the TDCALL instruction to make
> @@ -205,3 +212,150 @@ SYM_FUNC_START(__tdvmcall_vendor_kvm)
>         call do_tdvmcall
>         retq
>  SYM_FUNC_END(__tdvmcall_vendor_kvm)
> +
> +.macro io_save_registers
> +       push %rbp
> +       push %rbx
> +       push %rcx
> +       push %rdx
> +       push %rdi
> +       push %rsi
> +       push %r8
> +       push %r9
> +       push %r10
> +       push %r11
> +       push %r12
> +       push %r13
> +       push %r14
> +       push %r15

Surely there's an existing macro for this pattern? Would
PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
would eliminate clearing of %r8.

> +.endm
> +.macro io_restore_registers
> +       pop %r15
> +       pop %r14
> +       pop %r13
> +       pop %r12
> +       pop %r11
> +       pop %r10
> +       pop %r9
> +       pop %r8
> +       pop %rsi
> +       pop %rdi
> +       pop %rdx
> +       pop %rcx
> +       pop %rbx
> +       pop %rbp
> +.endm
> +
> +/*
> + * tdg_out{b,w,l}()  - Write given data to the specified port.
> + *
> + * @arg1 (RAX)       - Value to be written (passed via R8 to do_tdvmcall()).
> + * @arg2 (RDX)       - Port id (passed via RCX to do_tdvmcall()).
> + *
> + */
> +SYM_FUNC_START(tdg_outb)
> +       io_save_registers
> +       xor %r8, %r8
> +       /* Move data to R8 register */
> +       mov %al, %r8b
> +       /* Set data width to 1 byte */
> +       mov $1, %rsi
> +       jmp 1f
> +
> +SYM_FUNC_START(tdg_outw)
> +       io_save_registers
> +       xor %r8, %r8
> +       /* Move data to R8 register */
> +       mov %ax, %r8w
> +       /* Set data width to 2 bytes */
> +       mov $2, %rsi
> +       jmp 1f
> +
> +SYM_FUNC_START(tdg_outl)
> +       io_save_registers
> +       xor %r8, %r8
> +       /* Move data to R8 register */
> +       mov %eax, %r8d
> +       /* Set data width to 4 bytes */
> +       mov $4, %rsi
> +1:
> +       /*
> +        * Since io_save_registers does not save rax
> +        * state, save it here so that we can preserve
> +        * the caller register state.
> +        */
> +       push %rax
> +
> +       mov %rdx, %rcx
> +       /* Set 1 in RDX to select out operation */
> +       mov $1, %rdx
> +       /* Set TDVMCALL function id in RDI */
> +       mov $EXIT_REASON_IO_INSTRUCTION, %rdi
> +       /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
> +       xor %r10, %r10
> +       /* Since we don't use tdvmcall output, set it to NULL */
> +       xor %r9, %r9
> +
> +       call do_tdvmcall
> +
> +       pop %rax
> +       io_restore_registers
> +       ret
> +SYM_FUNC_END(tdg_outb)
> +SYM_FUNC_END(tdg_outw)
> +SYM_FUNC_END(tdg_outl)
> +EXPORT_SYMBOL(tdg_outb)
> +EXPORT_SYMBOL(tdg_outw)
> +EXPORT_SYMBOL(tdg_outl)
> +
> +/*
> + * tdg_in{b,w,l}()   - Read data to the specified port.
> + *
> + * @arg1 (RDX)       - Port id (passed via RCX to do_tdvmcall()).
> + *
> + * Returns data read via RAX register.
> + *
> + */
> +SYM_FUNC_START(tdg_inb)
> +       io_save_registers
> +       /* Set data width to 1 byte */
> +       mov $1, %rsi
> +       jmp 1f
> +
> +SYM_FUNC_START(tdg_inw)
> +       io_save_registers
> +       /* Set data width to 2 bytes */
> +       mov $2, %rsi
> +       jmp 1f
> +
> +SYM_FUNC_START(tdg_inl)
> +       io_save_registers
> +       /* Set data width to 4 bytes */
> +       mov $4, %rsi
> +1:
> +       mov %rdx, %rcx
> +       /* Set 0 in RDX to select in operation */
> +       mov $0, %rdx
> +       /* Set TDVMCALL function id in RDI */
> +       mov $EXIT_REASON_IO_INSTRUCTION, %rdi
> +       /* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
> +       xor %r10, %r10
> +       /* Allocate memory in stack for Output */
> +       subq $TDVMCALL_OUTPUT_SIZE, %rsp

Why is this leaf function responsibility? I would expect the core
do_tdvmcall (or whatever it is renamed to) helper to hide output
buffer payload handling. tdg_in* only wants 1, 2, or 4 bytes, not 40
bytes of payload to handle.

> +       /* Move tdvmcall_output pointer to R9 */
> +       movq %rsp, %r9
> +
> +       call do_tdvmcall
> +
> +       /* Move data read from port to RAX */
> +       mov TDVMCALL_r11(%r9), %eax

"TDVMCALL_r11" is unreadable, what is that doing?

Shouldn't failed in* calls signal failure with an all ones result?

> +       /* Free allocated memory */
> +       addq $TDVMCALL_OUTPUT_SIZE, %rsp
> +       io_restore_registers
> +       ret
> +SYM_FUNC_END(tdg_inb)
> +SYM_FUNC_END(tdg_inw)
> +SYM_FUNC_END(tdg_inl)
> +EXPORT_SYMBOL(tdg_inb)
> +EXPORT_SYMBOL(tdg_inw)
> +EXPORT_SYMBOL(tdg_inl)
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index e42e260df245..ec61f2f06c98 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -189,6 +189,36 @@ static void tdg_handle_cpuid(struct pt_regs *regs)
>         regs->dx = out.r15;
>  }
>
> +static void tdg_out(int size, int port, unsigned int value)
> +{
> +       tdvmcall(EXIT_REASON_IO_INSTRUCTION, size, 1, port, value);
> +}
> +
> +static unsigned int tdg_in(int size, int port)
> +{
> +       return tdvmcall_out_r11(EXIT_REASON_IO_INSTRUCTION, size, 0, port, 0);
> +}
> +
> +static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
> +{
> +       bool string = exit_qual & 16;
> +       int out, size, port;
> +
> +       /* I/O strings ops are unrolled at build time. */
> +       BUG_ON(string);
> +
> +       out = (exit_qual & 8) ? 0 : 1;
> +       size = (exit_qual & 7) + 1;
> +       port = exit_qual >> 16;

This seems to be begging for exit_qual helpers to put symbolic names
on these operations.

> +
> +       if (out) {
> +               tdg_out(size, port, regs->ax);
> +       } else {
> +               regs->ax &= ~GENMASK(8 * size, 0);
> +               regs->ax |= tdg_in(size, port) & GENMASK(8 * size, 0);
> +       }
> +}
> +
>  unsigned long tdg_get_ve_info(struct ve_info *ve)
>  {
>         u64 ret;
> @@ -238,6 +268,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
>         case EXIT_REASON_CPUID:
>                 tdg_handle_cpuid(regs);
>                 break;
> +       case EXIT_REASON_IO_INSTRUCTION:
> +               tdg_handle_io(regs, ve->exit_qual);
> +               break;
>         default:
>                 pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
>                 return -EFAULT;
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-07 21:54   ` Dave Hansen
@ 2021-05-10 22:19     ` Kuppuswamy, Sathyanarayanan
  2021-05-10 22:23       ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-10 22:19 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 5/7/21 2:54 PM, Dave Hansen wrote:
> This doesn't seem much like common code to me.  It seems like 100% SEV
> code.  Is this really where we want to move it?

Both SEV and TDX code has requirement to enable
CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED and define force_dma_unencrypted()
function.

force_dma_unencrypted() is modified by patch titled "x86/tdx: Make DMA
pages shared" to add TDX guest specific support.

Since both SEV and TDX code uses it, its moved to common file.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-07 22:38     ` Andi Kleen
@ 2021-05-10 22:23       ` Kuppuswamy, Sathyanarayanan
  2021-05-10 22:30         ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-10 22:23 UTC (permalink / raw)
  To: Andi Kleen, Dave Hansen, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel

Hi Dave,

On 5/7/21 3:38 PM, Andi Kleen wrote:
> 
> On 5/7/2021 2:55 PM, Dave Hansen wrote:
>> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
>>>   static unsigned int __ioremap_check_encrypted(struct resource *res)
>>>   {
>>> -    if (!sev_active())
>>> +    if (!sev_active() && !is_tdx_guest())
>>>           return 0;
>> I think it's time to come up with a real name for all of the code that's
>> under: (sev_active() || is_tdx_guest()).
>>
>> "encrypted" isn't it, for sure.
> 
> I called it protected_guest() in some other patches.

If you are also fine with above mentioned function name, I can include it
in this series. Since we have many use cases of above condition, it will
be useful define it as helper function.

> 
> -Andi
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-10 22:19     ` Kuppuswamy, Sathyanarayanan
@ 2021-05-10 22:23       ` Dave Hansen
  2021-05-12 13:08         ` Kirill A. Shutemov
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-10 22:23 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 5/10/21 3:19 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 5/7/21 2:54 PM, Dave Hansen wrote:
>> This doesn't seem much like common code to me.  It seems like 100% SEV
>> code.  Is this really where we want to move it?
> 
> Both SEV and TDX code has requirement to enable
> CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED and define force_dma_unencrypted()
> function.
> 
> force_dma_unencrypted() is modified by patch titled "x86/tdx: Make DMA
> pages shared" to add TDX guest specific support.
> 
> Since both SEV and TDX code uses it, its moved to common file.

That's not an excuse to have a bunch of AMD (or Intel) feature-specific
code in a file named "common".  I'd make an attempt to keep them
separate and then call into the two separate functions *from* the common
function.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-10 22:23       ` Kuppuswamy, Sathyanarayanan
@ 2021-05-10 22:30         ` Dave Hansen
  2021-05-10 22:52           ` Sean Christopherson
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-10 22:30 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Andi Kleen, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel

On 5/10/21 3:23 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>>
>>>> -    if (!sev_active())
>>>> +    if (!sev_active() && !is_tdx_guest())
>>>>           return 0;
>>> I think it's time to come up with a real name for all of the code that's
>>> under: (sev_active() || is_tdx_guest()).
>>>
>>> "encrypted" isn't it, for sure.
>>
>> I called it protected_guest() in some other patches.
> 
> If you are also fine with above mentioned function name, I can include it
> in this series. Since we have many use cases of above condition, it will
> be useful define it as helper function.

FWIW, I think sev_active() has a horrible name.  Shouldn't that be
"is_sev_guest()"?  "sev_active()" could be read as "I'm a SEV host" or
"I'm a SEV guest" and "SEV is active".

protected_guest() seems fine to cover both, despite the horrid SEV
naming.  It'll actually be nice to banish it from appearing in many of
its uses. :)


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-10 22:30         ` Dave Hansen
@ 2021-05-10 22:52           ` Sean Christopherson
  2021-05-11  9:35             ` Borislav Petkov
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-10 22:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy, Sathyanarayanan, Andi Kleen, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel,
	Borislav Petkov

+Boris, who has similar opinions on sev_active().

On Mon, May 10, 2021, Dave Hansen wrote:
> On 5/10/21 3:23 PM, Kuppuswamy, Sathyanarayanan wrote:
> >>>>
> >>>> -    if (!sev_active())
> >>>> +    if (!sev_active() && !is_tdx_guest())
> >>>>           return 0;
> >>> I think it's time to come up with a real name for all of the code that's
> >>> under: (sev_active() || is_tdx_guest()).
> >>>
> >>> "encrypted" isn't it, for sure.
> >>
> >> I called it protected_guest() in some other patches.
> > 
> > If you are also fine with above mentioned function name, I can include it
> > in this series. Since we have many use cases of above condition, it will
> > be useful define it as helper function.
> 
> FWIW, I think sev_active() has a horrible name.  Shouldn't that be
> "is_sev_guest()"?  "sev_active()" could be read as "I'm a SEV host" or
> "I'm a SEV guest" and "SEV is active".

I can't find the thread offhand, but Boris proposed something along the lines of
cpu_has(), but specific to a given flavor of protected guest.  IIRC, it was
sev_guest_has(SEV_ES) or something like that.

I 100% agree that we should have actual feature bits somewhere for the various
protected guest flavors.

> protected_guest() seems fine to cover both, despite the horrid SEV
> naming.  It'll actually be nice to banish it from appearing in many of
> its uses. :)

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-10 21:57   ` Dan Williams
@ 2021-05-10 23:08     ` Andi Kleen
  2021-05-10 23:34       ` Dan Williams
  2021-05-11 15:35     ` Dave Hansen
  1 sibling, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-10 23:08 UTC (permalink / raw)
  To: Dan Williams, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List


On 5/10/2021 2:57 PM, Dan Williams wrote:
>
> There is a mix of direct-TDVMCALL usage and handling #VE when and why
> is either approached used?

For the really early code in the decompressor or the main kernel we 
can't use #VE because the IDT needed for handling the exception is not 
set up, and some other infrastructure needed by the handler is missing. 
The early code needs to do port IO to be able to write the early serial 
console. To keep it all common it ended up that all port IO is paravirt. 
Actually for most the main kernel port IO calls we could just use #VE 
and it would result in smaller binaries, but then we would need to 
annotate all early portio with some special name. That's why port IO is 
all TDCALL.

For some others the only thing that really has to be #VE is MMIO because 
we don't want to annotate every MMIO read*/write* with an alternative 
(which would result in incredible binary bloat) For the others they have 
mostly become now direct calls.


>
>> Decompression code uses port IO for earlyprintk. We must use
>> paravirt calls there too if we want to allow earlyprintk.
> What is the tradeoff between teaching the decompression code to handle
> #VE (the implied assumption) vs teaching it to avoid #VE with direct
> TDVMCALLs (the chosen direction)?

The decompression code only really needs it to output something. But you 
couldn't debug anything until #VE is set up. Also the decompression code 
has a very basic environment that doesn't supply most kernel services, 
and the #VE handler is relatively complicated. It would probably need to 
be duplicated and the instruction decoder be ported to work in this 
environment. It would be all a lot of work, just to make the debug 
output work.

>
>> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>> ---
>>   arch/x86/boot/compressed/Makefile |   1 +
>>   arch/x86/boot/compressed/tdcall.S |   9 ++
>>   arch/x86/include/asm/io.h         |   5 +-
>>   arch/x86/include/asm/tdx.h        |  46 ++++++++-
>>   arch/x86/kernel/tdcall.S          | 154 ++++++++++++++++++++++++++++++
> Why is this named "tdcall" when it is implementing tdvmcalls? I must
> say those names don't really help me understand what they do. Can we
> have Linux names that don't mandate keeping the spec terminology in my
> brain's translation cache?

The instruction is called TDCALL. It's always the same instruction

TDVMCALL is the variant when the host processes it (as opposed to the 
TDX module), but it's just a different name space in the call number.


             \

> Is there a unified Linux name these can be given to stop the
> proliferation of poor vendor names for similar concepts?

We could use protected_guest()


>
> Does it also not know how to handle #VE to keep it aligned with the
> runtime code?


Not sure I understand the question, but the decompression code supports 
neither alternatives nor #VE. It's a very limited environment.

>
> Outside the boot decompression code isn't this branch of the "ifdef
> BOOT_COMPRESSED_MISC_H"  handled by #VE? I also don't see any usage of
> __{in,out}() in this patch.

I thought it was all alternative after decompression, so the #VE code 
shouldn't be called. We still have it for some reason though.


>
> Perhaps "PAYLOAD_SIZE" since it is used for both input and output?
>
> If the ABI does not include the size of the payload then how would
> code detect if even 80 bytes was violated in the future?


The payload in memory is just a Linux concept. At the TDCALL level it's 
only registers.


>
> 5
> Surely there's an existing macro for this pattern? Would
> PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
> would eliminate clearing of %r8.


There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in 
some past refactorings.


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-10 23:08     ` Andi Kleen
@ 2021-05-10 23:34       ` Dan Williams
  2021-05-11  0:01         ` Andi Kleen
                           ` (2 more replies)
  0 siblings, 3 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-10 23:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Mon, May 10, 2021 at 4:08 PM Andi Kleen <ak@linux.intel.com> wrote:
>
>
> On 5/10/2021 2:57 PM, Dan Williams wrote:
> >
> > There is a mix of direct-TDVMCALL usage and handling #VE when and why
> > is either approached used?
>
> For the really early code in the decompressor or the main kernel we
> can't use #VE because the IDT needed for handling the exception is not
> set up, and some other infrastructure needed by the handler is missing.
> The early code needs to do port IO to be able to write the early serial
> console. To keep it all common it ended up that all port IO is paravirt.
> Actually for most the main kernel port IO calls we could just use #VE
> and it would result in smaller binaries, but then we would need to
> annotate all early portio with some special name. That's why port IO is
> all TDCALL.

Thanks Andi. Sathya, please include the above in the next posting.

>
> For some others the only thing that really has to be #VE is MMIO because
> we don't want to annotate every MMIO read*/write* with an alternative
> (which would result in incredible binary bloat) For the others they have
> mostly become now direct calls.
>
>
> >
> >> Decompression code uses port IO for earlyprintk. We must use
> >> paravirt calls there too if we want to allow earlyprintk.
> > What is the tradeoff between teaching the decompression code to handle
> > #VE (the implied assumption) vs teaching it to avoid #VE with direct
> > TDVMCALLs (the chosen direction)?
>
> The decompression code only really needs it to output something. But you
> couldn't debug anything until #VE is set up. Also the decompression code
> has a very basic environment that doesn't supply most kernel services,
> and the #VE handler is relatively complicated. It would probably need to
> be duplicated and the instruction decoder be ported to work in this
> environment. It would be all a lot of work, just to make the debug
> output work.
>
> >
> >> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> >> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> >> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> >> ---
> >>   arch/x86/boot/compressed/Makefile |   1 +
> >>   arch/x86/boot/compressed/tdcall.S |   9 ++
> >>   arch/x86/include/asm/io.h         |   5 +-
> >>   arch/x86/include/asm/tdx.h        |  46 ++++++++-
> >>   arch/x86/kernel/tdcall.S          | 154 ++++++++++++++++++++++++++++++
> > Why is this named "tdcall" when it is implementing tdvmcalls? I must
> > say those names don't really help me understand what they do. Can we
> > have Linux names that don't mandate keeping the spec terminology in my
> > brain's translation cache?
>
> The instruction is called TDCALL. It's always the same instruction
>
> TDVMCALL is the variant when the host processes it (as opposed to the
> TDX module), but it's just a different name space in the call number.
>
>

Ok.

>              \
>
> > Is there a unified Linux name these can be given to stop the
> > proliferation of poor vendor names for similar concepts?
>
> We could use protected_guest()

Looks good.

>
>
> >
> > Does it also not know how to handle #VE to keep it aligned with the
> > runtime code?
>
>
> Not sure I understand the question, but the decompression code supports
> neither alternatives nor #VE. It's a very limited environment.

Yes, that addresses the question.

>
> >
> > Outside the boot decompression code isn't this branch of the "ifdef
> > BOOT_COMPRESSED_MISC_H"  handled by #VE? I also don't see any usage of
> > __{in,out}() in this patch.
>
> I thought it was all alternative after decompression, so the #VE code
> shouldn't be called. We still have it for some reason though.

Right, I'm struggling to understand where these spurious in/out
instructions are coming from that are not replaced by the
alternative's code? Shouldn't those be dropped on the floor and warned
about rather than handled? I.e. shouldn't port-io instruction escapes
that would cause #VE be precluded at build-time?

>
>
> >
> > Perhaps "PAYLOAD_SIZE" since it is used for both input and output?
> >
> > If the ABI does not include the size of the payload then how would
> > code detect if even 80 bytes was violated in the future?
>
>
> The payload in memory is just a Linux concept. At the TDCALL level it's
> only registers.
>

If it's only a Linux concept why does this code need to "prepare for
the future"?


> >
> > 5
> > Surely there's an existing macro for this pattern? Would
> > PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
> > would eliminate clearing of %r8.
>
>
> There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
> some past refactorings.

Not a huge deal, but at a minimum it seems a generic construct that
deserves to be declared centrally rather than tdx-guest-port-io local.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-10 23:34       ` Dan Williams
@ 2021-05-11  0:01         ` Andi Kleen
  2021-05-11  0:21           ` Dan Williams
  2021-05-11  0:30         ` Kuppuswamy, Sathyanarayanan
  2021-05-11  0:56         ` Kuppuswamy, Sathyanarayanan
  2 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-11  0:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Mon, May 10, 2021 at 04:34:34PM -0700, Dan Williams wrote:
> > > Outside the boot decompression code isn't this branch of the "ifdef
> > > BOOT_COMPRESSED_MISC_H"  handled by #VE? I also don't see any usage of
> > > __{in,out}() in this patch.
> >
> > I thought it was all alternative after decompression, so the #VE code
> > shouldn't be called. We still have it for some reason though.
> 
> Right, I'm struggling to understand where these spurious in/out
> instructions are coming from that are not replaced by the
> alternative's code?

There should be nothing in the main tree at least.

> Shouldn't those be dropped on the floor and warned
> about rather than handled? 

It might be related to eventually handling them in ring 3, but
I believe we disallow that currently too and it's not all that useful
anyways.  So yes it could be forbidden.

> I.e. shouldn't port-io instruction escapes
> that would cause #VE be precluded at build-time?

You mean in objtool? That would seem like overkill for a more theoretical
problem.

> > There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
> > some past refactorings.
> 
> Not a huge deal, but at a minimum it seems a generic construct that
> deserves to be declared centrally rather than tdx-guest-port-io local.

Yes I agree. We should just bring SAVE_ALL/SAVE_REGS back.

-Andi

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11  0:01         ` Andi Kleen
@ 2021-05-11  0:21           ` Dan Williams
  0 siblings, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-11  0:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Mon, May 10, 2021 at 5:01 PM Andi Kleen <ak@lin
[..]
> > I.e. shouldn't port-io instruction escapes
> > that would cause #VE be precluded at build-time?
>
> You mean in objtool? That would seem like overkill for a more theoretical
> problem.

Oh, sorry, no, I was not implying objtool overkill, just that the
mainline kernel should not be surprised by spurious instruction usage.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-10 23:34       ` Dan Williams
  2021-05-11  0:01         ` Andi Kleen
@ 2021-05-11  0:30         ` Kuppuswamy, Sathyanarayanan
  2021-05-11  1:07           ` Dan Williams
  2021-05-11  0:56         ` Kuppuswamy, Sathyanarayanan
  2 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-11  0:30 UTC (permalink / raw)
  To: Dan Williams, Andi Kleen
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List



On 5/10/21 4:34 PM, Dan Williams wrote:
> On Mon, May 10, 2021 at 4:08 PM Andi Kleen <ak@linux.intel.com> wrote:
>>
>>
>> On 5/10/2021 2:57 PM, Dan Williams wrote:
>>>
>>> There is a mix of direct-TDVMCALL usage and handling #VE when and why
>>> is either approached used?
>>
>> For the really early code in the decompressor or the main kernel we
>> can't use #VE because the IDT needed for handling the exception is not
>> set up, and some other infrastructure needed by the handler is missing.
>> The early code needs to do port IO to be able to write the early serial
>> console. To keep it all common it ended up that all port IO is paravirt.
>> Actually for most the main kernel port IO calls we could just use #VE
>> and it would result in smaller binaries, but then we would need to
>> annotate all early portio with some special name. That's why port IO is
>> all TDCALL.
> 
> Thanks Andi. Sathya, please include the above in the next posting.

Will include it.

> 
>>
>> For some others the only thing that really has to be #VE is MMIO because
>> we don't want to annotate every MMIO read*/write* with an alternative
>> (which would result in incredible binary bloat) For the others they have
>> mostly become now direct calls.
>>
>>
>>>
>>>> Decompression code uses port IO for earlyprintk. We must use
>>>> paravirt calls there too if we want to allow earlyprintk.
>>> What is the tradeoff between teaching the decompression code to handle
>>> #VE (the implied assumption) vs teaching it to avoid #VE with direct
>>> TDVMCALLs (the chosen direction)?
>>
>> The decompression code only really needs it to output something. But you
>> couldn't debug anything until #VE is set up. Also the decompression code
>> has a very basic environment that doesn't supply most kernel services,
>> and the #VE handler is relatively complicated. It would probably need to
>> be duplicated and the instruction decoder be ported to work in this
>> environment. It would be all a lot of work, just to make the debug
>> output work.
>>
>>>
>>>> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>>>> ---
>>>>    arch/x86/boot/compressed/Makefile |   1 +
>>>>    arch/x86/boot/compressed/tdcall.S |   9 ++
>>>>    arch/x86/include/asm/io.h         |   5 +-
>>>>    arch/x86/include/asm/tdx.h        |  46 ++++++++-
>>>>    arch/x86/kernel/tdcall.S          | 154 ++++++++++++++++++++++++++++++
>>> Why is this named "tdcall" when it is implementing tdvmcalls? I must
>>> say those names don't really help me understand what they do. Can we
>>> have Linux names that don't mandate keeping the spec terminology in my
>>> brain's translation cache?
>>
>> The instruction is called TDCALL. It's always the same instruction
>>
>> TDVMCALL is the variant when the host processes it (as opposed to the
>> TDX module), but it's just a different name space in the call number.
>>
>>
> 
> Ok.
> 
>>               \
>>
>>> Is there a unified Linux name these can be given to stop the
>>> proliferation of poor vendor names for similar concepts?
>>
>> We could use protected_guest()
> 
> Looks good.
> 
>>
>>
>>>
>>> Does it also not know how to handle #VE to keep it aligned with the
>>> runtime code?
>>
>>
>> Not sure I understand the question, but the decompression code supports
>> neither alternatives nor #VE. It's a very limited environment.
> 
> Yes, that addresses the question.
> 
>>
>>>
>>> Outside the boot decompression code isn't this branch of the "ifdef
>>> BOOT_COMPRESSED_MISC_H"  handled by #VE? I also don't see any usage of
>>> __{in,out}() in this patch.
>>
>> I thought it was all alternative after decompression, so the #VE code
>> shouldn't be called. We still have it for some reason though.
> 
> Right, I'm struggling to understand where these spurious in/out
> instructions are coming from that are not replaced by the
> alternative's code? Shouldn't those be dropped on the floor and warned
> about rather than handled? I.e. shouldn't port-io instruction escapes
> that would cause #VE be precluded at build-time?
> 
>>
>>
>>>
>>> Perhaps "PAYLOAD_SIZE" since it is used for both input and output?
>>>
>>> If the ABI does not include the size of the payload then how would
>>> code detect if even 80 bytes was violated in the future?
>>
>>
>> The payload in memory is just a Linux concept. At the TDCALL level it's
>> only registers.
>>
> 
> If it's only a Linux concept why does this code need to "prepare for
> the future"?

It is the software only structure. It is created to group all the output
registers used by VMM. You can find more details about it in patch titled
# "[RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions"

It is mainly used by functions like __tdx_hypercall(),__tdx_hypercall_vendor_kvm()
and tdx_in{b,w,l}.

u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
                     struct tdx_hypercall_output *out);
u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
                                u64 r15, struct tdx_hypercall_output *out);

struct tdx_hypercall_output {
         u64 r11;
         u64 r12;
         u64 r13;
         u64 r14;
         u64 r15;
};


Functions like __tdx_hypercall() and __tdx_hypercall_vendor_kvm() are used
by TDX guest to request services (like RDMSR, WRMSR,GetQuote, etc) from VMM
using TDCALL instruction. do_tdx_hypercall() is the helper function (in
tdcall.S) which actually implements this ABI.

As per current ABI, VMM will use registers R11-R15 to share the output
values with the guest. So we have defined the structure
struct tdx_hypercall_output to group all output registers and make it easier
to share it with users of the TDCALLs. This is Linux defined structure.

If there are any changes in TDCALL ABI for VMM, we might have to extend
this structure to accommodate new output register changes.  So if we
define TDVMCALL_OUTPUT_SIZE as 40, we will have modify this value for
any future struct tdx_hypercall_output changes. So to avoid it, we have
allocated double the size.

May be I should define it as,

#define TDVMCALL_OUTPUT_SIZE            sizeof(struct tdx_hypercall_output)

But currently we don't include the asm/tdx.h (which defines
struct tdx_hypercall_output) in tdcall.S. So I have defined the size as
constant value.

> 
> 
>>>
>>> 5
>>> Surely there's an existing macro for this pattern? Would
>>> PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
>>> would eliminate clearing of %r8.
>>
>>
>> There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
>> some past refactorings.
> 
> Not a huge deal, but at a minimum it seems a generic construct that
> deserves to be declared centrally rather than tdx-guest-port-io local.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-10 23:34       ` Dan Williams
  2021-05-11  0:01         ` Andi Kleen
  2021-05-11  0:30         ` Kuppuswamy, Sathyanarayanan
@ 2021-05-11  0:56         ` Kuppuswamy, Sathyanarayanan
  2021-05-11  2:19           ` Andi Kleen
  2 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-11  0:56 UTC (permalink / raw)
  To: Dan Williams, Andi Kleen
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List



On 5/10/21 4:34 PM, Dan Williams wrote:
>>> Surely there's an existing macro for this pattern? Would
>>> PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
>>> would eliminate clearing of %r8.
>>
>> There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
>> some past refactorings.
> Not a huge deal, but at a minimum it seems a generic construct that
> deserves to be declared centrally rather than tdx-guest-port-io local.

I can define SAVE_ALL_REGS/RESTORE_ALL_REGS. Do you want to move it outside
TDX code? I don't know if there will be other users for it?



-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11  0:30         ` Kuppuswamy, Sathyanarayanan
@ 2021-05-11  1:07           ` Dan Williams
  2021-05-11  2:29             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-11  1:07 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Andi Kleen, Peter Zijlstra, Andy Lutomirski, Dave Hansen,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Mon, May 10, 2021 at 5:30 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
[..]
> It is mainly used by functions like __tdx_hypercall(),__tdx_hypercall_vendor_kvm()
> and tdx_in{b,w,l}.
>
> u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
>                      struct tdx_hypercall_output *out);
> u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
>                                 u64 r15, struct tdx_hypercall_output *out);
>
> struct tdx_hypercall_output {
>          u64 r11;
>          u64 r12;
>          u64 r13;
>          u64 r14;
>          u64 r15;
> };

Why is this by register name and not something like:

struct tdx_hypercall_payload {
  u64 data[5];
};

...because the code in this patch is reading the payload out of a
stack relative offset, not r11.

>
>
> Functions like __tdx_hypercall() and __tdx_hypercall_vendor_kvm() are used
> by TDX guest to request services (like RDMSR, WRMSR,GetQuote, etc) from VMM
> using TDCALL instruction. do_tdx_hypercall() is the helper function (in
> tdcall.S) which actually implements this ABI.
>
> As per current ABI, VMM will use registers R11-R15 to share the output
> values with the guest.

Which ABI, __tdx_hypercall_vendor_kvm()? The code is putting the
payload on the stack, so I'm not sure what ABI you are referring to?


> So we have defined the structure
> struct tdx_hypercall_output to group all output registers and make it easier
> to share it with users of the TDCALLs. This is Linux defined structure.
>
> If there are any changes in TDCALL ABI for VMM, we might have to extend
> this structure to accommodate new output register changes.  So if we
> define TDVMCALL_OUTPUT_SIZE as 40, we will have modify this value for
> any future struct tdx_hypercall_output changes. So to avoid it, we have
> allocated double the size.
>
> May be I should define it as,
>
> #define TDVMCALL_OUTPUT_SIZE            sizeof(struct tdx_hypercall_output)

An arrangement like that seems more reasonable than a seemingly
arbitrary number and an ominous warning about things that may happen
in the future.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-04-26 18:01 ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Kuppuswamy Sathyanarayanan
@ 2021-05-11  1:23   ` Dan Williams
  2021-05-11  2:17     ` Andi Kleen
  2021-05-11 14:08     ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Dave Hansen
  2021-05-11 15:53   ` Dave Hansen
  1 sibling, 2 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-11  1:23 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Mon, Apr 26, 2021 at 11:02 AM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> When running as a TDX guest, there are a number of existing,
> privileged instructions that do not work. If the guest kernel
> uses these instructions, the hardware generates a #VE.
>
> You can find the list of unsupported instructions in Intel
> Trust Domain Extensions (Intel® TDX) Module specification,
> sec 9.2.2 and in Guest-Host Communication Interface (GHCI)
> Specification for Intel TDX, sec 2.4.1.
>

Ah, better than the "handle port io" patch, these details at least
give the reader a chance.

> To prevent TD guest from using MWAIT/MONITOR instructions,
> support for these instructions are already disabled by TDX
> module (SEAM). So CPUID flags for these instructions should
> be in disabled state.

Why does this not result in a #UD if the instruction is disabled by
SEAM? How is it possible to execute a disabled instruction (one
precluded by CPUID) to the point where it triggers #VE instead of #UD?

> After the above mentioned preventive measures, if TD guests still
> execute these instructions, add appropriate warning messages in #VE
> handler. For WBIND instruction, since it's related to memory writeback
> and cache flushes, it's mainly used in context of IO devices. Since
> TDX 1.0 does not support non-virtual I/O devices, skipping it should
> not cause any fatal issues.

WBINVD is in a different class than MWAIT/MONITOR since it is not
identified by CPUID, it can't possibly have the same #UD behaviour.
It's not clear why WBINVD is included in the same patch as
MWAIT/MONITOR?

I disagree with the assertion that WBINVD is mainly used in the
context of I/O devices, it's also used for ACPI power management
paths. WBINVD dependent functionality should be dynamically disabled
rather than warned about.

Does a TDX guest support out-of-tree modules?  The kernel is already
tainted when out-of-tree modules are loaded. In other words in-tree
modules preclude forbidden instructions because they can just be
audited, and out-of-tree modules are ok to trigger abrupt failure if
they attempt to use forbidden instructions.

> But to let users know about its usage, use
> WARN() to report about it.. For MWAIT/MONITOR instruction, since its
> unsupported use WARN() to report unsupported usage.

I'm not sure how useful warning is outside of a kernel developer's
debug environment. The kernel should know what instructions are
disabled and which are available. WBINVD in particular has potential
data integrity implications. Code that might lead to a WBINVD usage
should be disabled, not run all the way up to where WBINVD is
attempted and then trigger an after-the-fact WARN_ONCE().

The WBINVD change deserves to be split off from MWAIT/MONITOR, and
more thought needs to be put into where these spurious instruction
usages are arising.

>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/kernel/tdx.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 3fe617978fc4..294dda5bf3f6 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -371,6 +371,21 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
>         case EXIT_REASON_EPT_VIOLATION:
>                 ve->instr_len = tdg_handle_mmio(regs, ve);
>                 break;
> +       case EXIT_REASON_WBINVD:
> +               /*
> +                * WBINVD is not supported inside TDX guests. All in-
> +                * kernel uses should have been disabled.
> +                */
> +               WARN_ONCE(1, "TD Guest used unsupported WBINVD instruction\n");
> +               break;
> +       case EXIT_REASON_MONITOR_INSTRUCTION:
> +       case EXIT_REASON_MWAIT_INSTRUCTION:
> +               /*
> +                * Something in the kernel used MONITOR or MWAIT despite
> +                * X86_FEATURE_MWAIT being cleared for TDX guests.
> +                */
> +               WARN_ONCE(1, "TD Guest used unsupported MWAIT/MONITOR instruction\n");
> +               break;
>         default:
>                 pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
>                 return -EFAULT;
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11  1:23   ` Dan Williams
@ 2021-05-11  2:17     ` Andi Kleen
  2021-05-11  2:44       ` Kuppuswamy, Sathyanarayanan
  2021-05-11 15:37       ` Dan Williams
  2021-05-11 14:08     ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Dave Hansen
  1 sibling, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-11  2:17 UTC (permalink / raw)
  To: Dan Williams, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List

>> To prevent TD guest from using MWAIT/MONITOR instructions,
>> support for these instructions are already disabled by TDX
>> module (SEAM). So CPUID flags for these instructions should
>> be in disabled state.
> Why does this not result in a #UD if the instruction is disabled by
> SEAM?

It's just the TDX module (SEAM is the execution mode used by the TDX module)


> How is it possible to execute a disabled instruction (one
> precluded by CPUID) to the point where it triggers #VE instead of #UD?

That's how the TDX module works. It never injects anything else other 
than #VE. You can still get other exceptions of course, but they won't 
come from the TDX module.

>> After the above mentioned preventive measures, if TD guests still
>> execute these instructions, add appropriate warning messages in #VE
>> handler. For WBIND instruction, since it's related to memory writeback
>> and cache flushes, it's mainly used in context of IO devices. Since
>> TDX 1.0 does not support non-virtual I/O devices, skipping it should
>> not cause any fatal issues.
> WBINVD is in a different class than MWAIT/MONITOR since it is not
> identified by CPUID, it can't possibly have the same #UD behaviour.
> It's not clear why WBINVD is included in the same patch as
> MWAIT/MONITOR?

Because these are all instructions we never expect to execute, so 
nothing special is needed for them. That's a unique class that logically 
fits together.


>
> I disagree with the assertion that WBINVD is mainly used in the
> context of I/O devices, it's also used for ACPI power management
> paths.

You mean S3? That's of course also not supported inside TDX.


>   WBINVD dependent functionality should be dynamically disabled
> rather than warned about.
>
> Does a TDX guest support out-of-tree modules?  The kernel is already
> tainted when out-of-tree modules are loaded. In other words in-tree
> modules preclude forbidden instructions because they can just be
> audited, and out-of-tree modules are ok to trigger abrupt failure if
> they attempt to use forbidden instructions.

We already did a lot of bi^wdiscussion on this on the last review.

Originally we had a different handling, this was the result of previous 
feedback.

It doesn't really matter because it should never happen.


>
>> But to let users know about its usage, use
>> WARN() to report about it.. For MWAIT/MONITOR instruction, since its
>> unsupported use WARN() to report unsupported usage.
> I'm not sure how useful warning is outside of a kernel developer's
> debug environment. The kernel should know what instructions are
> disabled and which are available. WBINVD in particular has potential
> data integrity implications. Code that might lead to a WBINVD usage
> should be disabled, not run all the way up to where WBINVD is
> attempted and then trigger an after-the-fact WARN_ONCE().

We don't expect the warning to ever happen. Yes all of this will be 
disabled. Nearly all are in code paths that cannot happen inside TDX 
anyways due to missing PCI-IDs or different cpuids, and S3 is explicitly 
disabled and would be impossible anyways due to lack of BIOS support.




>
> The WBINVD change deserves to be split off from MWAIT/MONITOR, and
> more thought needs to be put into where these spurious instruction
> usages are arising.

I disagree. We already spent a lot of cycles on this. WBINVD makes never 
sense in current TDX and all the code will be disabled.


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11  0:56         ` Kuppuswamy, Sathyanarayanan
@ 2021-05-11  2:19           ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-11  2:19 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List


On 5/10/2021 5:56 PM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 5/10/21 4:34 PM, Dan Williams wrote:
>>>> Surely there's an existing macro for this pattern? Would
>>>> PUSH_AND_CLEAR_REGS + POP_REGS be suitable? Besides code sharing it
>>>> would eliminate clearing of %r8.
>>>
>>> There used to be SAVE_ALL/SAVE_REGS, but they have been all removed in
>>> some past refactorings.
>> Not a huge deal, but at a minimum it seems a generic construct that
>> deserves to be declared centrally rather than tdx-guest-port-io local.
>
> I can define SAVE_ALL_REGS/RESTORE_ALL_REGS. Do you want to move it 
> outside
> TDX code? I don't know if there will be other users for it?

The old name was SAVE_ALL / SAVE_REGS.

Yes please put it outside tdx code into some include file.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11  1:07           ` Dan Williams
@ 2021-05-11  2:29             ` Kuppuswamy, Sathyanarayanan
  2021-05-11 14:39               ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-11  2:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andi Kleen, Peter Zijlstra, Andy Lutomirski, Dave Hansen,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List



On 5/10/21 6:07 PM, Dan Williams wrote:
> On Mon, May 10, 2021 at 5:30 PM Kuppuswamy, Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> [..]
>> It is mainly used by functions like __tdx_hypercall(),__tdx_hypercall_vendor_kvm()
>> and tdx_in{b,w,l}.
>>
>> u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
>>                       struct tdx_hypercall_output *out);
>> u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
>>                                  u64 r15, struct tdx_hypercall_output *out);
>>
>> struct tdx_hypercall_output {
>>           u64 r11;
>>           u64 r12;
>>           u64 r13;
>>           u64 r14;
>>           u64 r15;
>> };
> 
> Why is this by register name and not something like:
> 
> struct tdx_hypercall_payload {
>    u64 data[5];
> };
> 
> ...because the code in this patch is reading the payload out of a
> stack relative offset, not r11.

Since this patch allocates this memory in ASM code, we read it via
offset. If you see other use cases in tdx.c, you will notice the use
of register names.

static void tdg_handle_cpuid(struct pt_regs *regs)
{
         u64 ret;
         struct tdx_hypercall_output out = {0};

         ret = __tdx_hypercall(EXIT_REASON_CPUID, regs->ax,
                               regs->cx, 0, 0, &out);

         WARN_ON(ret);

         regs->ax = out.r12;
         regs->bx = out.r13;
         regs->cx = out.r14;
         regs->dx = out.r15;
}

static u64 tdg_read_msr_safe(unsigned int msr, int *err)
{
         u64 ret;
         struct tdx_hypercall_output out = {0};

         WARN_ON_ONCE(tdg_is_context_switched_msr(msr));

         /*
          * Since CSTAR MSR is not used by Intel CPUs as SYSCALL
          * instruction, just ignore it. Even raising TDVMCALL
          * will lead to same result.
          */
         if (msr == MSR_CSTAR)
                 return 0;

         ret = __tdx_hypercall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out);

         *err = ret ? -EIO : 0;

         return out.r11;
}


> 
>>
>>
>> Functions like __tdx_hypercall() and __tdx_hypercall_vendor_kvm() are used
>> by TDX guest to request services (like RDMSR, WRMSR,GetQuote, etc) from VMM
>> using TDCALL instruction. do_tdx_hypercall() is the helper function (in
>> tdcall.S) which actually implements this ABI.
>>
>> As per current ABI, VMM will use registers R11-R15 to share the output
>> values with the guest.
> 
> Which ABI,

TDCALL ABI (see sections 3.1 to 3.12 and look for Output Operands in each TDVMCALL variant).

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

  __tdx_hypercall_vendor_kvm()? The code is putting the
> payload on the stack, so I'm not sure what ABI you are referring to?
> 
> 
>> So we have defined the structure
>> struct tdx_hypercall_output to group all output registers and make it easier
>> to share it with users of the TDCALLs. This is Linux defined structure.
>>
>> If there are any changes in TDCALL ABI for VMM, we might have to extend
>> this structure to accommodate new output register changes.  So if we
>> define TDVMCALL_OUTPUT_SIZE as 40, we will have modify this value for
>> any future struct tdx_hypercall_output changes. So to avoid it, we have
>> allocated double the size.
>>
>> May be I should define it as,
>>
>> #define TDVMCALL_OUTPUT_SIZE            sizeof(struct tdx_hypercall_output)
> 
> An arrangement like that seems more reasonable than a seemingly
> arbitrary number and an ominous warning about things that may happen
> in the future.

I will use the above format.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11  2:17     ` Andi Kleen
@ 2021-05-11  2:44       ` Kuppuswamy, Sathyanarayanan
  2021-05-11  2:51         ` Andi Kleen
  2021-05-11 15:37       ` Dan Williams
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-11  2:44 UTC (permalink / raw)
  To: Andi Kleen, Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List



On 5/10/21 7:17 PM, Andi Kleen wrote:
>>> To prevent TD guest from using MWAIT/MONITOR instructions,
>>> support for these instructions are already disabled by TDX
>>> module (SEAM). So CPUID flags for these instructions should
>>> be in disabled state.
>> Why does this not result in a #UD if the instruction is disabled by
>> SEAM?
> 
> It's just the TDX module (SEAM is the execution mode used by the TDX module)

If it is disabled by the TDX Module, we should never execute it. But for some
reason, if we still come across this instruction (buggy TDX module?), we add
appropriate warning in  #VE handler.

> 
> 
>> How is it possible to execute a disabled instruction (one
>> precluded by CPUID) to the point where it triggers #VE instead of #UD?
> 
> That's how the TDX module works. It never injects anything else other than #VE. You can still get 
> other exceptions of course, but they won't come from the TDX module.
> 
>>> After the above mentioned preventive measures, if TD guests still
>>> execute these instructions, add appropriate warning messages in #VE
>>> handler. For WBIND instruction, since it's related to memory writeback
>>> and cache flushes, it's mainly used in context of IO devices. Since
>>> TDX 1.0 does not support non-virtual I/O devices, skipping it should
>>> not cause any fatal issues.
>> WBINVD is in a different class than MWAIT/MONITOR since it is not
>> identified by CPUID, it can't possibly have the same #UD behaviour.
>> It's not clear why WBINVD is included in the same patch as
>> MWAIT/MONITOR?
> 
> Because these are all instructions we never expect to execute, so nothing special is needed for 
> them. That's a unique class that logically fits together.

Yes, for all these three instruction we don't need any special
handling code. So they are grouped together.

> 
> 
>>
>> I disagree with the assertion that WBINVD is mainly used in the
>> context of I/O devices, it's also used for ACPI power management
>> paths.
> 
> You mean S3? That's of course also not supported inside TDX.
> 
> 
>>   WBINVD dependent functionality should be dynamically disabled
>> rather than warned about.
>>
>> Does a TDX guest support out-of-tree modules?  The kernel is already
>> tainted when out-of-tree modules are loaded. In other words in-tree
>> modules preclude forbidden instructions because they can just be
>> audited, and out-of-tree modules are ok to trigger abrupt failure if
>> they attempt to use forbidden instructions.
> 
> We already did a lot of bi^wdiscussion on this on the last review.
> 
> Originally we had a different handling, this was the result of previous feedback.
> 
> It doesn't really matter because it should never happen.
> 
> 
>>
>>> But to let users know about its usage, use
>>> WARN() to report about it.. For MWAIT/MONITOR instruction, since its
>>> unsupported use WARN() to report unsupported usage.
>> I'm not sure how useful warning is outside of a kernel developer's
>> debug environment. The kernel should know what instructions are
>> disabled and which are available. WBINVD in particular has potential
>> data integrity implications. Code that might lead to a WBINVD usage
>> should be disabled, not run all the way up to where WBINVD is
>> attempted and then trigger an after-the-fact WARN_ONCE().
> 
> We don't expect the warning to ever happen. Yes all of this will be disabled. Nearly all are in code 
> paths that cannot happen inside TDX anyways due to missing PCI-IDs or different cpuids, and S3 is 
> explicitly disabled and would be impossible anyways due to lack of BIOS support.

We have added WARN to let user know about its usage and fix it. By default we should
never hit this path.

> 
> 
> 
> 
>>
>> The WBINVD change deserves to be split off from MWAIT/MONITOR, and
>> more thought needs to be put into where these spurious instruction
>> usages are arising.
> 
> I disagree. We already spent a lot of cycles on this. WBINVD makes never sense in current TDX and 
> all the code will be disabled.

> 
> 
> -Andi
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11  2:44       ` Kuppuswamy, Sathyanarayanan
@ 2021-05-11  2:51         ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-11  2:51 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List


On 5/10/2021 7:44 PM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 5/10/21 7:17 PM, Andi Kleen wrote:
>>>> To prevent TD guest from using MWAIT/MONITOR instructions,
>>>> support for these instructions are already disabled by TDX
>>>> module (SEAM). So CPUID flags for these instructions should
>>>> be in disabled state.
>>> Why does this not result in a #UD if the instruction is disabled by
>>> SEAM?
>>
>> It's just the TDX module (SEAM is the execution mode used by the TDX 
>> module)
>
> If it is disabled by the TDX Module, we should never execute it. But 
> for some
> reason, if we still come across this instruction (buggy TDX module?), 
> we add
> appropriate warning in  #VE handler.

I think the only case where it could happen is if the kernel jumps to a 
random address due to a bug and the destination happens to be these 
instruction bytes. Of course it is exceedingly unlikely.

Or we make some mistake, but that's hopefully fixed quickly.


-Andi

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 28/32] x86/tdx: Make pages shared in ioremap()
  2021-05-10 22:52           ` Sean Christopherson
@ 2021-05-11  9:35             ` Borislav Petkov
  2021-05-20 20:12               ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Borislav Petkov @ 2021-05-11  9:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, Kuppuswamy, Sathyanarayanan, Andi Kleen,
	Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel

On Mon, May 10, 2021 at 10:52:49PM +0000, Sean Christopherson wrote:
> I can't find the thread offhand, but Boris proposed something along the lines of
> cpu_has(), but specific to a given flavor of protected guest.  IIRC, it was
> sev_guest_has(SEV_ES) or something like that.
> 
> I 100% agree that we should have actual feature bits somewhere for the various
> protected guest flavors.

Preach brother! :)

/me goes and greps mailboxes...

ah, do you mean this, per chance:

https://lore.kernel.org/kvm/20210421144402.GB5004@zn.tnic/

?

And yes, this has "sev" in the name and dhansen makes sense to me in
wishing to unify all the protected guest feature queries under a common
name. And then depending on the vendor, that common name will call the
respective vendor's helper to answer the protected guest aspect asked
about.

This way, generic code will call

	protected_guest_has()

or so and be nicely abstracted away from the underlying implementation.

Hohumm, yap, sounds nice to me.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11  1:23   ` Dan Williams
  2021-05-11  2:17     ` Andi Kleen
@ 2021-05-11 14:08     ` Dave Hansen
  2021-05-11 16:09       ` Sean Christopherson
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 14:08 UTC (permalink / raw)
  To: Dan Williams, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List

On 5/10/21 6:23 PM, Dan Williams wrote:
>> To prevent TD guest from using MWAIT/MONITOR instructions,
>> support for these instructions are already disabled by TDX
>> module (SEAM). So CPUID flags for these instructions should
>> be in disabled state.
> Why does this not result in a #UD if the instruction is disabled by
> SEAM? How is it possible to execute a disabled instruction (one
> precluded by CPUID) to the point where it triggers #VE instead of #UD?

This is actually a vestige of VMX.  It's quite possible toady to have a
feature which isn't enumerated in CPUID which still exists and "works"
in the silicon.  There are all kinds of pitfalls to doing this, but
folks evidently do it in public clouds all the time.

The CPUID virtualization basically just traps into the hypervisor and
lets the hypervisor set whatever register values it wants to appear when
CPUID "returns".

But, the controls for what instructions generate #UD are actually quite
separate and unrelated to CPUID itself.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11  2:29             ` Kuppuswamy, Sathyanarayanan
@ 2021-05-11 14:39               ` Dave Hansen
  2021-05-11 15:08                 ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 14:39 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Dan Williams
  Cc: Andi Kleen, Peter Zijlstra, Andy Lutomirski, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List

On 5/10/21 7:29 PM, Kuppuswamy, Sathyanarayanan wrote:
> On 5/10/21 6:07 PM, Dan Williams wrote:
>> On Mon, May 10, 2021 at 5:30 PM Kuppuswamy, Sathyanarayanan
>> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>> [..]
>>> It is mainly used by functions like
>>> __tdx_hypercall(),__tdx_hypercall_vendor_kvm()
>>> and tdx_in{b,w,l}.
>>>
>>> u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
>>>                       struct tdx_hypercall_output *out);
>>> u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
>>>                                  u64 r15, struct tdx_hypercall_output
>>> *out);
>>>
>>> struct tdx_hypercall_output {
>>>           u64 r11;
>>>           u64 r12;
>>>           u64 r13;
>>>           u64 r14;
>>>           u64 r15;
>>> };
>>
>> Why is this by register name and not something like:
>>
>> struct tdx_hypercall_payload {
>>    u64 data[5];
>> };
>>
>> ...because the code in this patch is reading the payload out of a
>> stack relative offset, not r11.
> 
> Since this patch allocates this memory in ASM code, we read it via
> offset. If you see other use cases in tdx.c, you will notice the use
> of register names.

To what you do you refer by "this patch allocates this memory in ASM
code"?  Could you point to the specific ASM code that "allocates memory"?

Dan I'll try to answer your question.  TDX has both a "hypercall"
interface for guests to call into hosts and a "seamcall" interface where
guests or hosts can talk to the TDX/SEAM module.

Both of these represent an ABI which _resembles_ a system call ABI.
Values are placed in registers, including a "function" register which is
very similar to a the system call number we place in RAX.

*But* those ABIs was actually designed to (IIRC) resemble the
Windows/Microsoft ABI, not the Linux ABI.  So the register conventions
are unfamiliar.  There is assembly code to convert between the ELF
function call ABI and the TDX ABIs.

For instance, if you are in C code and you call:

	__tdx_hypercall_vendor_kvm(u64 fn, u64 r12, ...

The value for "fn" will be placed in RAX and "r12" will be placed in RDI
for the function call itself.  The assembly code will, for instance,
take the "r12" *VARIABLE* and ensure it gets into the R12 *REGISTER* for
the hypercall.

The same thing happens on the output side.  The TDX ABIs specify
"return" values in certain registers (r11-r15).  However, those
registers are not preserved in our function return ABI.  So, they must
be stashed off in memory into a place where the caller can retrieve them.

Rather than being unstructured "data[]", the value in
tdx_hypercall_output->r11 was actually in register R11 at some point.
If you look at the spec, you can see the functions that use R11.

Let's say there's a hypercall to check for whether puppies are cute.
Here's the kernel side:

bool tdx_hypercall_puppies_are_cute()
{
	struct tdx_hypercall_output out;
	u64 ret;

	ret = __tdx_hypercall_vendor_kvm(HOST_LIKES_PUPPIES, ..., &out);

	/* Did the hypercall even succeed? */
	if (ret != SUCCESS)
		return -EINVAL;

	if (out->r11 == TDX_WHATEVER_CUTE_BIT)
		return true;

	// Nope, I guess puppies are not cute
	return false;
}

The spec would actually say, "Blah blah, puppies are cute if
TDX_WHATEVER_CUTE_BIT is set in r11".  So, this whole setup actually
results in really nice C code that you can sit side-by-side with the
spec and see if they agree.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11 14:39               ` Dave Hansen
@ 2021-05-11 15:08                 ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-11 15:08 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Andi Kleen, Peter Zijlstra, Andy Lutomirski, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List



On 5/11/21 7:39 AM, Dave Hansen wrote:
> To what you do you refer by "this patch allocates this memory in ASM
> code"?  Could you point to the specific ASM code that "allocates memory"?

We use 40 bytes in stack for storing the output register values. It is in
function tdg_inl().

subq $TDVMCALL_OUTPUT_SIZE, %rsp

+SYM_FUNC_START(tdg_inl)
+	io_save_registers
+	/* Set data width to 4 bytes */
+	mov $4, %rsi
+1:
+	mov %rdx, %rcx
+	/* Set 0 in RDX to select in operation */
+	mov $0, %rdx
+	/* Set TDVMCALL function id in RDI */
+	mov $EXIT_REASON_IO_INSTRUCTION, %rdi
+	/* Set TDVMCALL type info (0 - Standard, > 0 - vendor) in R10 */
+	xor %r10, %r10
+	/* Allocate memory in stack for Output */
+	subq $TDVMCALL_OUTPUT_SIZE, %rsp
+	/* Move tdvmcall_output pointer to R9 */
+	movq %rsp, %r9
+
+	call do_tdvmcall

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-10 21:57   ` Dan Williams
  2021-05-10 23:08     ` Andi Kleen
@ 2021-05-11 15:35     ` Dave Hansen
  2021-05-11 15:43       ` Dan Williams
  2021-05-12  6:17       ` Dan Williams
  1 sibling, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 15:35 UTC (permalink / raw)
  To: Dan Williams, Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List

On 5/10/21 2:57 PM, Dan Williams wrote:
>> Decompression code uses port IO for earlyprintk. We must use
>> paravirt calls there too if we want to allow earlyprintk.
> What is the tradeoff between teaching the decompression code to handle
> #VE (the implied assumption) vs teaching it to avoid #VE with direct
> TDVMCALLs (the chosen direction)?

To me, the tradeoff is not just "teaching" the code to handle a #VE, but
ensuring that the entire architecture works.

Intentionally invoking a #VE is like making a function call that *MIGHT*
recurse on itself.  Sure, you can try to come up with a story about
bounding the recursion.  But, I don't see any semblance of that in this
series.

Exception-based recursion is really nasty because it's implicit, not
explicit.  That's why I'm advocating for a design where the kernel never
intentionally causes a #VE: it never intentionally recurses without bounds.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11  2:17     ` Andi Kleen
  2021-05-11  2:44       ` Kuppuswamy, Sathyanarayanan
@ 2021-05-11 15:37       ` Dan Williams
  2021-05-11 15:42         ` Andi Kleen
  2021-05-11 15:44         ` Dave Hansen
  1 sibling, 2 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-11 15:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Mon, May 10, 2021 at 7:17 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> >> To prevent TD guest from using MWAIT/MONITOR instructions,
> >> support for these instructions are already disabled by TDX
> >> module (SEAM). So CPUID flags for these instructions should
> >> be in disabled state.
> > Why does this not result in a #UD if the instruction is disabled by
> > SEAM?
>
> It's just the TDX module (SEAM is the execution mode used by the TDX module)
>
>
> > How is it possible to execute a disabled instruction (one
> > precluded by CPUID) to the point where it triggers #VE instead of #UD?
>
> That's how the TDX module works. It never injects anything else other
> than #VE. You can still get other exceptions of course, but they won't
> come from the TDX module.
>
> >> After the above mentioned preventive measures, if TD guests still
> >> execute these instructions, add appropriate warning messages in #VE
> >> handler. For WBIND instruction, since it's related to memory writeback
> >> and cache flushes, it's mainly used in context of IO devices. Since
> >> TDX 1.0 does not support non-virtual I/O devices, skipping it should
> >> not cause any fatal issues.
> > WBINVD is in a different class than MWAIT/MONITOR since it is not
> > identified by CPUID, it can't possibly have the same #UD behaviour.
> > It's not clear why WBINVD is included in the same patch as
> > MWAIT/MONITOR?
>
> Because these are all instructions we never expect to execute, so
> nothing special is needed for them. That's a unique class that logically
> fits together.
>
>
> >
> > I disagree with the assertion that WBINVD is mainly used in the
> > context of I/O devices, it's also used for ACPI power management
> > paths.
>
> You mean S3? That's of course also not supported inside TDX.
>
>
> >   WBINVD dependent functionality should be dynamically disabled
> > rather than warned about.
> >
> > Does a TDX guest support out-of-tree modules?  The kernel is already
> > tainted when out-of-tree modules are loaded. In other words in-tree
> > modules preclude forbidden instructions because they can just be
> > audited, and out-of-tree modules are ok to trigger abrupt failure if
> > they attempt to use forbidden instructions.
>
> We already did a lot of bi^wdiscussion on this on the last review.
>
> Originally we had a different handling, this was the result of previous
> feedback.
>
> It doesn't really matter because it should never happen.
>
>
> >
> >> But to let users know about its usage, use
> >> WARN() to report about it.. For MWAIT/MONITOR instruction, since its
> >> unsupported use WARN() to report unsupported usage.
> > I'm not sure how useful warning is outside of a kernel developer's
> > debug environment. The kernel should know what instructions are
> > disabled and which are available. WBINVD in particular has potential
> > data integrity implications. Code that might lead to a WBINVD usage
> > should be disabled, not run all the way up to where WBINVD is
> > attempted and then trigger an after-the-fact WARN_ONCE().
>
> We don't expect the warning to ever happen. Yes all of this will be
> disabled. Nearly all are in code paths that cannot happen inside TDX
> anyways due to missing PCI-IDs or different cpuids, and S3 is explicitly
> disabled and would be impossible anyways due to lack of BIOS support.
>
>
>
>
> >
> > The WBINVD change deserves to be split off from MWAIT/MONITOR, and
> > more thought needs to be put into where these spurious instruction
> > usages are arising.
>
> I disagree. We already spent a lot of cycles on this. WBINVD makes never
> sense in current TDX and all the code will be disabled.

Why not just drop the patch if it continues to cause people to spend
cycles on it and it addresses a problem that will never happen?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 15:37       ` Dan Williams
@ 2021-05-11 15:42         ` Andi Kleen
  2021-05-11 15:44         ` Dave Hansen
  1 sibling, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-11 15:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dave Hansen, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List


On 5/11/2021 8:37 AM, Dan Williams wrote:
> O
> Why not just drop the patch if it continues to cause people to spend
> cycles on it and it addresses a problem that will never happen?

We want to at least get some kind of warning if there is really a 
mistake. Just dropping such an ability wouldn't seem right.

That's all that the patch does really.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11 15:35     ` Dave Hansen
@ 2021-05-11 15:43       ` Dan Williams
  2021-05-12  6:17       ` Dan Williams
  1 sibling, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-11 15:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Tue, May 11, 2021 at 8:36 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/10/21 2:57 PM, Dan Williams wrote:
> >> Decompression code uses port IO for earlyprintk. We must use
> >> paravirt calls there too if we want to allow earlyprintk.
> > What is the tradeoff between teaching the decompression code to handle
> > #VE (the implied assumption) vs teaching it to avoid #VE with direct
> > TDVMCALLs (the chosen direction)?
>
> To me, the tradeoff is not just "teaching" the code to handle a #VE, but
> ensuring that the entire architecture works.
>
> Intentionally invoking a #VE is like making a function call that *MIGHT*
> recurse on itself.  Sure, you can try to come up with a story about
> bounding the recursion.  But, I don't see any semblance of that in this
> series.
>
> Exception-based recursion is really nasty because it's implicit, not
> explicit.  That's why I'm advocating for a design where the kernel never
> intentionally causes a #VE: it never intentionally recurses without bounds.

Thanks Dave, this really helps.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 15:37       ` Dan Williams
  2021-05-11 15:42         ` Andi Kleen
@ 2021-05-11 15:44         ` Dave Hansen
  2021-05-11 15:50           ` Dan Williams
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 15:44 UTC (permalink / raw)
  To: Dan Williams, Andi Kleen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On 5/11/21 8:37 AM, Dan Williams wrote:
>> I disagree. We already spent a lot of cycles on this. WBINVD makes never
>> sense in current TDX and all the code will be disabled.
> Why not just drop the patch if it continues to cause people to spend
> cycles on it and it addresses a problem that will never happen?

If someone calls WBINVD, we have a bug.  Not a little bug, either.  It
probably means there's some horribly confused kernel code that's now
facing broken cache coherency.  To me, it's a textbook place to use
BUG_ON().

This also doesn't "address" the problem, it just helps produce a more
coherent warning message.  It's why we have OOPS messages in the page
fault handler: it never makes any sense to dereference a NULL pointer,
yet we have code to make debugging them easier.  It's well worth the ~20
lines of code that this costs us for ease of debugging.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 15:44         ` Dave Hansen
@ 2021-05-11 15:50           ` Dan Williams
  2021-05-11 15:52             ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-11 15:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Tue, May 11, 2021 at 8:45 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/11/21 8:37 AM, Dan Williams wrote:
> >> I disagree. We already spent a lot of cycles on this. WBINVD makes never
> >> sense in current TDX and all the code will be disabled.
> > Why not just drop the patch if it continues to cause people to spend
> > cycles on it and it addresses a problem that will never happen?
>
> If someone calls WBINVD, we have a bug.  Not a little bug, either.  It
> probably means there's some horribly confused kernel code that's now
> facing broken cache coherency.  To me, it's a textbook place to use
> BUG_ON().
>
> This also doesn't "address" the problem, it just helps produce a more
> coherent warning message.  It's why we have OOPS messages in the page
> fault handler: it never makes any sense to dereference a NULL pointer,
> yet we have code to make debugging them easier.  It's well worth the ~20
> lines of code that this costs us for ease of debugging.

The 'default' case in this 'switch' prints the exit reason and faults,
can't that also trigger a backtrace that dumps the exception stack and
the faulting instruction? In other words shouldn't this just fail with
a common way to provide better debug on any unhandled #VE and not try
to continue running past something that "can't" happen?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 15:50           ` Dan Williams
@ 2021-05-11 15:52             ` Andi Kleen
  2021-05-11 16:04               ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-11 15:52 UTC (permalink / raw)
  To: Dan Williams, Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List


> The 'default' case in this 'switch' prints the exit reason and faults,
> can't that also trigger a backtrace that dumps the exception stack and
> the faulting instruction? In other words shouldn't this just fail with
> a common way to provide better debug on any unhandled #VE and not try
> to continue running past something that "can't" happen?

It will use the #GP common code which will do all the backtracing etc.

We didn't think we would need anything else than what #GP already does.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-04-26 18:01 ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Kuppuswamy Sathyanarayanan
  2021-05-11  1:23   ` Dan Williams
@ 2021-05-11 15:53   ` Dave Hansen
  1 sibling, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 15:53 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> For WBIND instruction, since it's related to memory writeback

	^ WBINVD

> and cache flushes, it's mainly used in context of IO devices. Since
> TDX 1.0 does not support non-virtual I/O devices, skipping it should
> not cause any fatal issues. But
Do me a favor:

	grep -ri wbinvd arch/x86/

How many I/O devices do you see?

Please get your ducks in a row here.  Come up with a coherent changelog
about why the arch/x86 use of WBINVD doesn't apply to TDX guests.
Explain the audit that you did.  You *DID* do an audit, right?


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 15:52             ` Andi Kleen
@ 2021-05-11 16:04               ` Dave Hansen
  2021-05-11 17:06                 ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 16:04 UTC (permalink / raw)
  To: Andi Kleen, Dan Williams
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On 5/11/21 8:52 AM, Andi Kleen wrote:
>> The 'default' case in this 'switch' prints the exit reason and faults,
>> can't that also trigger a backtrace that dumps the exception stack and
>> the faulting instruction? In other words shouldn't this just fail with
>> a common way to provide better debug on any unhandled #VE and not try
>> to continue running past something that "can't" happen?
> 
> It will use the #GP common code which will do all the backtracing etc.
> 
> We didn't think we would need anything else than what #GP already does.

How do these end up in practice?  Do they still say "general protection
fault..."?

Isn't that really mean for anyone that goes trying to figure out what
caused these?  If they see a "general protection fault" from WBINVD and
go digging in the SDM for how a #GP can come from WBINVD, won't they be
sorely disappointed?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 14:08     ` [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD Dave Hansen
@ 2021-05-11 16:09       ` Sean Christopherson
  2021-05-11 16:16         ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-11 16:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dan Williams, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Linux Kernel Mailing List

On Tue, May 11, 2021, Dave Hansen wrote:
> On 5/10/21 6:23 PM, Dan Williams wrote:
> >> To prevent TD guest from using MWAIT/MONITOR instructions,
> >> support for these instructions are already disabled by TDX
> >> module (SEAM). So CPUID flags for these instructions should
> >> be in disabled state.
> > Why does this not result in a #UD if the instruction is disabled by
> > SEAM? How is it possible to execute a disabled instruction (one
> > precluded by CPUID) to the point where it triggers #VE instead of #UD?
> 
> This is actually a vestige of VMX.  It's quite possible toady to have a
> feature which isn't enumerated in CPUID which still exists and "works"
> in the silicon.

No, virtualization holes are something else entirely.  

MONITOR/MWAIT are a bit weird; they do have an enable bit in IA32_MISC_ENABLE,
but most VMMs don't context switch IA32_MISC_ENABLE (load guest value on entry,
load host value on exit) because that would add ~250 cycles to every host<->guest
transition.  And IA32_MISC_ENABLE is shared between SMT siblings, which further
complicates loading the guest's value into hardware.  In the end, it's easier to
leave MONITOR/MWAIT enabled in hardware and instead force a VM-Exit.

As for why TDX injects #VE instead of #UD, I suspect it's for the same reason
that KVM emulates MONITOR/MWAIT as nops instead of injecting a #UD.  The CPUID
bit for MONITOR/MWAIT reflects their enabling in IA32_MISC_ENABLE, not raw
support in hardware.  That means there's no definitive way to enumerate to BIOS
that MONITOR/MWAIT are not supported, e.g. AFAICT, EDKII blindly assumes it can
enable MONITOR/MWAIT in IA32_MISC_ENABLE.  To justify #UD instead of #VE, TDX
would have to inject #GP on WRMSR to set IA32_MISC_ENABLE.ENABLE_MONITOR, and
even then there would be weirdness with respect to VMM behavior in response to
TDVMCALL(WRMSR) since the VMM could allow the virtual write.  In the end, it's
again simpler to inject #VE.

> There are all kinds of pitfalls to doing this, but folks evidently do it in
> public clouds all the time.

Virtualization holes are when instructions/features are enumerated via CPUID,
but don't have a control to hide the feature from the guest (or in the case of
CET, multiple feature are buried behind a single control).  So even if the VMM
hides the feature via CPUID, the guest can still _cleanly_ execute the
instruction if it's supported by the underlying hardware.

> The CPUID virtualization basically just traps into the hypervisor and
> lets the hypervisor set whatever register values it wants to appear when
> CPUID "returns".
> 
> But, the controls for what instructions generate #UD are actually quite
> separate and unrelated to CPUID itself.

Eh, any sane VMM will accurately represent its virtual CPU model via CPUID
insofar as possible, there are just too many creaky corners in x86 to make things
100% bombproof.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 16:09       ` Sean Christopherson
@ 2021-05-11 16:16         ` Dave Hansen
  0 siblings, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 16:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dan Williams, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Linux Kernel Mailing List

On 5/11/21 9:09 AM, Sean Christopherson wrote:
>>> Why does this not result in a #UD if the instruction is disabled by
>>> SEAM? How is it possible to execute a disabled instruction (one
>>> precluded by CPUID) to the point where it triggers #VE instead of #UD?
>> This is actually a vestige of VMX.  It's quite possible toady to have a
>> feature which isn't enumerated in CPUID which still exists and "works"
>> in the silicon.
> No, virtualization holes are something else entirely.  

I think the bigger point is that *CPUID* doesn't enable or disable
instructions in and of itself.

It can *reflect* enabling (like OSPKE), but nothing is actually enabled
or disabled via CPUID.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 16:04               ` Dave Hansen
@ 2021-05-11 17:06                 ` Andi Kleen
  2021-05-11 17:42                   ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-11 17:06 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List


need anything else than what #GP already does.

> How do these end up in practice?  Do they still say "general protection
> fault..."?

Yes, but there's a #VE specific message before it that prints the exit 
reason.


>
> Isn't that really mean for anyone that goes trying to figure out what
> caused these?  If they see a "general protection fault" from WBINVD and
> go digging in the SDM for how a #GP can come from WBINVD, won't they be
> sorely disappointed?

They'll see both the message and also that it isn't a true #VE in the 
backtrace.


-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 17:06                 ` Andi Kleen
@ 2021-05-11 17:42                   ` Dave Hansen
  2021-05-11 17:48                     ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-11 17:42 UTC (permalink / raw)
  To: Andi Kleen, Dan Williams
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On 5/11/21 10:06 AM, Andi Kleen wrote:
>> How do these end up in practice?  Do they still say "general protection
>> fault..."?
> 
> Yes, but there's a #VE specific message before it that prints the exit
> reason.
> 
>> Isn't that really mean for anyone that goes trying to figure out what
>> caused these?  If they see a "general protection fault" from WBINVD and
>> go digging in the SDM for how a #GP can come from WBINVD, won't they be
>> sorely disappointed?
> 
> They'll see both the message and also that it isn't a true #VE in the
> backtrace.

Is there a good reason for the enduring "general protection fault..."
message other than an aversion to refactoring the code?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 16/32] x86/tdx: Handle MWAIT, MONITOR and WBINVD
  2021-05-11 17:42                   ` Dave Hansen
@ 2021-05-11 17:48                     ` Andi Kleen
  2021-05-24 23:32                       ` [RFC v2-fix-v2 1/2] x86/tdx: Handle MWAIT and MONITOR Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-11 17:48 UTC (permalink / raw)
  To: Dave Hansen, Dan Williams
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List


> Is there a good reason for the enduring "general protection fault..."
> message other than an aversion to refactoring the code?

You're the first ever to think it's a problem.

We're assuming that kernel developers are smart enough to understand this.

Please I implore everyone to move on from this patch. This is my last 
email on this topic.

-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 14/32] x86/tdx: Handle port I/O
  2021-05-11 15:35     ` Dave Hansen
  2021-05-11 15:43       ` Dan Williams
@ 2021-05-12  6:17       ` Dan Williams
  2021-05-27  4:23         ` [RFC v2-fix-v1 0/3] " Kuppuswamy Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-12  6:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	Linux Kernel Mailing List

On Tue, May 11, 2021 at 8:36 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/10/21 2:57 PM, Dan Williams wrote:
> >> Decompression code uses port IO for earlyprintk. We must use
> >> paravirt calls there too if we want to allow earlyprintk.
> > What is the tradeoff between teaching the decompression code to handle
> > #VE (the implied assumption) vs teaching it to avoid #VE with direct
> > TDVMCALLs (the chosen direction)?
>
> To me, the tradeoff is not just "teaching" the code to handle a #VE, but
> ensuring that the entire architecture works.
>
> Intentionally invoking a #VE is like making a function call that *MIGHT*
> recurse on itself.  Sure, you can try to come up with a story about
> bounding the recursion.  But, I don't see any semblance of that in this
> series.
>
> Exception-based recursion is really nasty because it's implicit, not
> explicit.  That's why I'm advocating for a design where the kernel never
> intentionally causes a #VE: it never intentionally recurses without bounds.

So this circles back to the common problem with the
mwait/monitor/wbinvd patch and this one. "Can't happen" #VE conditions
should be fatal. I.e. have a nice clear message about why the kernel
failed and halt. All the uses of these #VE triggering instructions can
be eliminated ahead of time with auditing and people that load
unaudited out-of-tree modules that trigger #VE get to keep the pieces.
Said pieces will be described to them by the #VE triggered fail
message. This isn't like split lock disable where the code is
difficult to audit.

What am I missing?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-05-10 15:56         ` Juergen Gross
@ 2021-05-12 12:07           ` Kirill A. Shutemov
  2021-05-12 13:18           ` Peter Zijlstra
  1 sibling, 0 replies; 381+ messages in thread
From: Kirill A. Shutemov @ 2021-05-12 12:07 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Andi Kleen, Borislav Petkov, Kuppuswamy Sathyanarayanan,
	Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

On Mon, May 10, 2021 at 05:56:05PM +0200, Juergen Gross wrote:
> On 10.05.21 17:52, Andi Kleen wrote:
> > \
> > > > > CONFIG_PARAVIRT_XL will be used by TDX that needs couple of paravirt
> > > > > calls that were hidden under CONFIG_PARAVIRT_XXL, but the rest of the
> > > > > config would be a bloat for TDX.
> > > > 
> > > > Used how? Why is it bloat for TDX?
> > > 
> > > Is there any major downside to move the halt related pvops functions
> > > from CONFIG_PARAVIRT_XXL to CONFIG_PARAVIRT?
> > 
> > I think the main motivation is to get rid of all the page table related
> > hooks for modern configurations. These are the bulk of the annotations
> > and  cause bloat and worse code. Shadow page tables are really obscure
> > these days and very few people still need them and it's totally
> > reasonable to build even widely used distribution kernels without them.
> > On contrast most of the other hooks are comparatively few and also on
> > comparatively slow paths, so don't really matter too much.
> > 
> > I think it would be ok to have a CONFIG_PARAVIRT that does not have page
> > table support, and a separate config option for those (that could be
> > eventually deprecated).
> > 
> > But that would break existing .configs for those shadow stack users,
> > that's why I think Kirill did it the other way around.
> 
> No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
> other hypervisor's guests, supporting basically the TLB flush operations
> and time related operations only. Adding the halt related operations to
> PARAVIRT wouldn't break anything.

Yeah, I think we can do this. It should be fine.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-05-08  0:59     ` Kuppuswamy, Sathyanarayanan
@ 2021-05-12 13:00       ` Kirill A. Shutemov
  2021-05-12 14:10         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Kirill A. Shutemov @ 2021-05-12 13:00 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel, Isaku Yamahata

On Fri, May 07, 2021 at 05:59:34PM -0700, Kuppuswamy, Sathyanarayanan wrote:
> 
> 
> On 5/7/21 2:46 PM, Dave Hansen wrote:
> > I know KVM does weird stuff.  But, this is*really*  weird.  Why are we
> > #including a .c file into another .c file?
> 
> I think Kirill implemented it this way to skip Makefile changes for it. I don't
> see any other KVM direct dependencies in tdx.c.
> 
> I will fix it in next version.

This has to be compiled only for TDX+KVM.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-10 22:23       ` Dave Hansen
@ 2021-05-12 13:08         ` Kirill A. Shutemov
  2021-05-12 15:44           ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kirill A. Shutemov @ 2021-05-12 13:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel

On Mon, May 10, 2021 at 03:23:29PM -0700, Dave Hansen wrote:
> On 5/10/21 3:19 PM, Kuppuswamy, Sathyanarayanan wrote:
> > On 5/7/21 2:54 PM, Dave Hansen wrote:
> >> This doesn't seem much like common code to me.  It seems like 100% SEV
> >> code.  Is this really where we want to move it?
> > 
> > Both SEV and TDX code has requirement to enable
> > CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED and define force_dma_unencrypted()
> > function.
> > 
> > force_dma_unencrypted() is modified by patch titled "x86/tdx: Make DMA
> > pages shared" to add TDX guest specific support.
> > 
> > Since both SEV and TDX code uses it, its moved to common file.
> 
> That's not an excuse to have a bunch of AMD (or Intel) feature-specific
> code in a file named "common".  I'd make an attempt to keep them
> separate and then call into the two separate functions *from* the common
> function.

But why? What good does the additional level of inderection brings?

It's like saying arch/x86/kernel/cpu/common.c shouldn't have anything AMD
or Intel specific. If a function can cover both vendors I don't see a
point for additinal complexity.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-05-10 15:56         ` Juergen Gross
  2021-05-12 12:07           ` Kirill A. Shutemov
@ 2021-05-12 13:18           ` Peter Zijlstra
  2021-05-12 13:24             ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Peter Zijlstra @ 2021-05-12 13:18 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Andi Kleen, Borislav Petkov, Kuppuswamy Sathyanarayanan,
	Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel

On Mon, May 10, 2021 at 05:56:05PM +0200, Juergen Gross wrote:

> No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
> other hypervisor's guests, supporting basically the TLB flush operations
> and time related operations only. Adding the halt related operations to
> PARAVIRT wouldn't break anything.

Also, I don't think anything modern should actually ever hit any of the
HLT instructions, most everything should end up at an MWAIT.

Still, do we wants to give arch_safe_halt() and halt() the
PVOP_ALT_VCALL0() treatment?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-05-12 13:18           ` Peter Zijlstra
@ 2021-05-12 13:24             ` Andi Kleen
  2021-05-12 13:51               ` Juergen Gross
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-12 13:24 UTC (permalink / raw)
  To: Peter Zijlstra, Juergen Gross
  Cc: Borislav Petkov, Kuppuswamy Sathyanarayanan, Andy Lutomirski,
	Dave Hansen, Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel


On 5/12/2021 6:18 AM, Peter Zijlstra wrote:
> On Mon, May 10, 2021 at 05:56:05PM +0200, Juergen Gross wrote:
>
>> No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
>> other hypervisor's guests, supporting basically the TLB flush operations
>> and time related operations only. Adding the halt related operations to
>> PARAVIRT wouldn't break anything.
> Also, I don't think anything modern should actually ever hit any of the
> HLT instructions, most everything should end up at an MWAIT.
>
> Still, do we wants to give arch_safe_halt() and halt() the
> PVOP_ALT_VCALL0() treatment?

 From performance reasons it's pointless to patch. HLT (and MWAIT) are 
so slow anyways that using patching or an indirect pointer is completely 
in the noise. So I would use whatever is cleanest in the code.

-Andi




^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 01/32] x86/paravirt: Introduce CONFIG_PARAVIRT_XL
  2021-05-12 13:24             ` Andi Kleen
@ 2021-05-12 13:51               ` Juergen Gross
  2021-05-17 23:50                 ` [RFC v2-fix 1/1] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Juergen Gross @ 2021-05-12 13:51 UTC (permalink / raw)
  To: Andi Kleen, Peter Zijlstra
  Cc: Borislav Petkov, Kuppuswamy Sathyanarayanan, Andy Lutomirski,
	Dave Hansen, Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel


[-- Attachment #1.1.1: Type: text/plain, Size: 960 bytes --]

On 12.05.21 15:24, Andi Kleen wrote:
> 
> On 5/12/2021 6:18 AM, Peter Zijlstra wrote:
>> On Mon, May 10, 2021 at 05:56:05PM +0200, Juergen Gross wrote:
>>
>>> No. We have PARAVIRT_XXL for Xen PV guests, and we have PARAVIRT for
>>> other hypervisor's guests, supporting basically the TLB flush operations
>>> and time related operations only. Adding the halt related operations to
>>> PARAVIRT wouldn't break anything.
>> Also, I don't think anything modern should actually ever hit any of the
>> HLT instructions, most everything should end up at an MWAIT.
>>
>> Still, do we wants to give arch_safe_halt() and halt() the
>> PVOP_ALT_VCALL0() treatment?
> 
>  From performance reasons it's pointless to patch. HLT (and MWAIT) are 
> so slow anyways that using patching or an indirect pointer is completely 
> in the noise. So I would use whatever is cleanest in the code.

This would probably be x86_platform_ops.hyper hooks.


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-05-12 13:00       ` Kirill A. Shutemov
@ 2021-05-12 14:10         ` Kuppuswamy, Sathyanarayanan
  2021-05-12 14:29           ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-12 14:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Peter Zijlstra, Andy Lutomirski, Dan Williams,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel, Isaku Yamahata



On 5/12/21 6:00 AM, Kirill A. Shutemov wrote:
> This has to be compiled only for TDX+KVM.

Got it. So if we want to remove the "C" file include, we will have to
add #ifdef CONFIG_KVM_GUEST in Makefile.

ifdef CONFIG_KVM_GUEST
obj-$(CONFIG_INTEL_TDX_GUEST) += tdx-kvm.o
#endif

Dave, do you prefer above change over "C" file include?

  25 #ifdef CONFIG_KVM_GUEST
  26 #include "tdx-kvm.c"
  27 #endif

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-05-12 14:10         ` Kuppuswamy, Sathyanarayanan
@ 2021-05-12 14:29           ` Dave Hansen
  2021-05-13 19:29             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-12 14:29 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Kirill A. Shutemov
  Cc: Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata

On 5/12/21 7:10 AM, Kuppuswamy, Sathyanarayanan wrote:
> On 5/12/21 6:00 AM, Kirill A. Shutemov wrote:
>> This has to be compiled only for TDX+KVM.
> 
> Got it. So if we want to remove the "C" file include, we will have to
> add #ifdef CONFIG_KVM_GUEST in Makefile.
> 
> ifdef CONFIG_KVM_GUEST
> obj-$(CONFIG_INTEL_TDX_GUEST) += tdx-kvm.o
> #endif

Is there truly no dependency between CONFIG_KVM_GUEST and
CONFIG_INTEL_TDX_GUEST?

If there isn't, then the way we do it is adding another (invisible)
Kconfig variable to express the dependency for tdx-kvm.o:

config INTEL_TDX_GUEST_KVM
	bool
	depends on KVM_GUEST && INTEL_TDX_GUEST

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-12 13:08         ` Kirill A. Shutemov
@ 2021-05-12 15:44           ` Dave Hansen
  2021-05-12 15:53             ` Sean Christopherson
  2021-05-18  1:28             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-12 15:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, Sean Christopherson,
	linux-kernel

On 5/12/21 6:08 AM, Kirill A. Shutemov wrote:
>> That's not an excuse to have a bunch of AMD (or Intel) feature-specific
>> code in a file named "common".  I'd make an attempt to keep them
>> separate and then call into the two separate functions *from* the common
>> function.
> But why? What good does the additional level of inderection brings?
> 
> It's like saying arch/x86/kernel/cpu/common.c shouldn't have anything AMD
> or Intel specific. If a function can cover both vendors I don't see a
> point for additinal complexity.

Because the code is already separate.  You're actually going to some
trouble to move the SEV-specific code and then combine it with the
TDX-specific code.

Anyway, please just give it a shot.  Should take all of ten minutes.  If
it doesn't work out in practice, fine.  You'll have a good paragraph for
the changelog.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-12 15:44           ` Dave Hansen
@ 2021-05-12 15:53             ` Sean Christopherson
  2021-05-13 16:40               ` Kuppuswamy, Sathyanarayanan
  2021-05-18  1:28             ` Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-12 15:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Kuppuswamy, Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel

On Wed, May 12, 2021, Dave Hansen wrote:
> On 5/12/21 6:08 AM, Kirill A. Shutemov wrote:
> >> That's not an excuse to have a bunch of AMD (or Intel) feature-specific
> >> code in a file named "common".  I'd make an attempt to keep them
> >> separate and then call into the two separate functions *from* the common
> >> function.
> > But why? What good does the additional level of inderection brings?
> > 
> > It's like saying arch/x86/kernel/cpu/common.c shouldn't have anything AMD
> > or Intel specific. If a function can cover both vendors I don't see a
> > point for additinal complexity.
> 
> Because the code is already separate.  You're actually going to some
> trouble to move the SEV-specific code and then combine it with the
> TDX-specific code.
> 
> Anyway, please just give it a shot.  Should take all of ten minutes.  If
> it doesn't work out in practice, fine.  You'll have a good paragraph for
> the changelog.

Or maybe wait to see how Boris' propose protected_guest_has() pans out?  E.g. if
we can do "protected_guest_has(MEMORY_ENCRYPTION)" or whatever, then the truly
common bits could be placed into common.c without any vendor-specific logic.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 21/32] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-04-26 18:01 ` [RFC v2 21/32] x86/boot: Add a trampoline for APs booting in 64-bit mode Kuppuswamy Sathyanarayanan
@ 2021-05-13  2:56   ` Dan Williams
  2021-05-18  0:54     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-13  2:56 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Sean Christopherson, Kai Huang

On Mon, Apr 26, 2021 at 11:03 AM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Add a trampoline for booting APs in 64-bit mode via a software handoff
> with BIOS, and use the new trampoline for the ACPI MP wake protocol used
> by TDX.

Lets add a spec reference:

See section "4.1 ACPI-MADT-AP-Wakeup Table" in the Guest-Host
Communication Interface specification for TDX.

Although, there is not much "wake protocol" in this patch, this
appears to be the end of the process after the CPU has been messaged
to start.

> Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
> mode.  For the GDT pointer, create a new entry as the existing storage
> for the pointer occupies the zero entry in the GDT itself.
>
> Reported-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/include/asm/realmode.h          |  1 +
>  arch/x86/kernel/smpboot.c                |  5 +++
>  arch/x86/realmode/rm/header.S            |  1 +
>  arch/x86/realmode/rm/trampoline_64.S     | 49 +++++++++++++++++++++++-
>  arch/x86/realmode/rm/trampoline_common.S |  5 ++-
>  5 files changed, 58 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
> index 5db5d083c873..5066c8b35e7c 100644
> --- a/arch/x86/include/asm/realmode.h
> +++ b/arch/x86/include/asm/realmode.h
> @@ -25,6 +25,7 @@ struct real_mode_header {
>         u32     sev_es_trampoline_start;
>  #endif
>  #ifdef CONFIG_X86_64
> +       u32     trampoline_start64;
>         u32     trampoline_pgd;
>  #endif
>         /* ACPI S3 wakeup */
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 16703c35a944..27d8491d753a 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1036,6 +1036,11 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
>         unsigned long boot_error = 0;
>         unsigned long timeout;
>
> +#ifdef CONFIG_X86_64
> +       if (is_tdx_guest())
> +               start_ip = real_mode_header->trampoline_start64;
> +#endif

Perhaps wrap this into an inline helper in
arch/x86/include/asm/realmode.h so that this routine only does one
assignment to @start_ip at function entry?

> +
>         idle->thread.sp = (unsigned long)task_pt_regs(idle);
>         early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
>         initial_code = (unsigned long)start_secondary;
> diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
> index 8c1db5bf5d78..2eb62be6d256 100644
> --- a/arch/x86/realmode/rm/header.S
> +++ b/arch/x86/realmode/rm/header.S
> @@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
>         .long   pa_sev_es_trampoline_start
>  #endif
>  #ifdef CONFIG_X86_64
> +       .long   pa_trampoline_start64
>         .long   pa_trampoline_pgd;
>  #endif
>         /* ACPI S3 wakeup */
> diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
> index 84c5d1b33d10..12b734b1da8b 100644
> --- a/arch/x86/realmode/rm/trampoline_64.S
> +++ b/arch/x86/realmode/rm/trampoline_64.S
> @@ -143,13 +143,20 @@ SYM_CODE_START(startup_32)
>         movl    %eax, %cr3
>
>         # Set up EFER
> +       movl    $MSR_EFER, %ecx
> +       rdmsr
> +       cmp     pa_tr_efer, %eax
> +       jne     .Lwrite_efer
> +       cmp     pa_tr_efer + 4, %edx
> +       je      .Ldone_efer
> +.Lwrite_efer:
>         movl    pa_tr_efer, %eax
>         movl    pa_tr_efer + 4, %edx
> -       movl    $MSR_EFER, %ecx
>         wrmsr

Is this hunk just a performance optimization to save an unnecessary
wrmsr when it is pre-populated with the right value? Is it required
for this patch? If "yes", it was not clear to me from the changelog,
if "no" seems like it belongs in a standalone optimization patch.

>
> +.Ldone_efer:
>         # Enable paging and in turn activate Long Mode
> -       movl    $(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
> +       movl    $(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax

It seems setting X86_CR0_NE is redundant when coming through
pa_trampoline_compat, is this a standalone fix to make sure that
'numeric-error' is enabled before startup_64?

>         movl    %eax, %cr0
>
>         /*
> @@ -161,6 +168,19 @@ SYM_CODE_START(startup_32)
>         ljmpl   $__KERNEL_CS, $pa_startup_64
>  SYM_CODE_END(startup_32)
>
> +SYM_CODE_START(pa_trampoline_compat)
> +       /*
> +        * In compatibility mode.  Prep ESP and DX for startup_32, then disable
> +        * paging and complete the switch to legacy 32-bit mode.
> +        */
> +       movl    $rm_stack_end, %esp
> +       movw    $__KERNEL_DS, %dx
> +
> +       movl    $(X86_CR0_NE | X86_CR0_PE), %eax
> +       movl    %eax, %cr0
> +       ljmpl   $__KERNEL32_CS, $pa_startup_32
> +SYM_CODE_END(pa_trampoline_compat)
> +
>         .section ".text64","ax"
>         .code64
>         .balign 4
> @@ -169,6 +189,20 @@ SYM_CODE_START(startup_64)
>         jmpq    *tr_start(%rip)
>  SYM_CODE_END(startup_64)
>
> +SYM_CODE_START(trampoline_start64)
> +       /*
> +        * APs start here on a direct transfer from 64-bit BIOS with identity
> +        * mapped page tables.  Load the kernel's GDT in order to gear down to
> +        * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
> +        * segment registers.  Load the zero IDT so any fault triggers a
> +        * shutdown instead of jumping back into BIOS.
> +        */
> +       lidt    tr_idt(%rip)
> +       lgdt    tr_gdt64(%rip)
> +
> +       ljmpl   *tr_compat(%rip)
> +SYM_CODE_END(trampoline_start64)
> +
>         .section ".rodata","a"
>         # Duplicate the global descriptor table
>         # so the kernel can live anywhere
> @@ -182,6 +216,17 @@ SYM_DATA_START(tr_gdt)
>         .quad   0x00cf93000000ffff      # __KERNEL_DS
>  SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
>
> +SYM_DATA_START(tr_gdt64)
> +       .short  tr_gdt_end - tr_gdt - 1 # gdt limit
> +       .long   pa_tr_gdt
> +       .long   0
> +SYM_DATA_END(tr_gdt64)
> +
> +SYM_DATA_START(tr_compat)
> +       .long   pa_trampoline_compat
> +       .short  __KERNEL32_CS
> +SYM_DATA_END(tr_compat)
> +
>         .bss
>         .balign PAGE_SIZE
>  SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
> diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
> index 5033e640f957..506d5897112a 100644
> --- a/arch/x86/realmode/rm/trampoline_common.S
> +++ b/arch/x86/realmode/rm/trampoline_common.S
> @@ -1,4 +1,7 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>         .section ".rodata","a"
>         .balign 16
> -SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
> +SYM_DATA_START_LOCAL(tr_idt)
> +       .short  0
> +       .quad   0
> +SYM_DATA_END(tr_idt)

Curious, is the following not equivalent?

-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+SYM_DATA_LOCAL(tr_idt, .fill 1, 10, 0)

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot for TDX platforms
  2021-04-26 18:01 ` [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot for TDX platforms Kuppuswamy Sathyanarayanan
@ 2021-05-13  3:03   ` Dan Williams
  0 siblings, 0 replies; 381+ messages in thread
From: Dan Williams @ 2021-05-13  3:03 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Sean Christopherson

On Mon, Apr 26, 2021 at 11:03 AM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Avoid operations which will inject #VE during compressed
> boot, which is obviously fatal for TDX platforms.
>
> Details are,
>
>  1. TDX module injects #VE if a TDX guest attempts to write
>     EFER. So skip the WRMSR to set EFER.LME=1 if it's already
>     set. TDX also forces EFER.LME=1, i.e. the branch will always
>     be taken and thus the #VE avoided.

Ah here's the justification for that hunk in the previous patch, are
you sure that hunk belongs in the trampoline patch?

>
>  2. TDX module also injects a #VE if the guest attempts to clear
>     CR0.NE. Ensure CR0.NE is set when loading CR0 during compressed
>     boot. The Setting CR0.NE should be a nop on all CPUs that
>     support 64-bit mode.

Ah, here's the justification for CR0.NE in the previous patch. Did
something go wrong in the patch splitting?

>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/boot/compressed/head_64.S | 5 +++--
>  arch/x86/boot/compressed/pgtable.h | 2 +-
>  2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index e94874f4bbc1..37c2f37d4a0d 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -616,8 +616,9 @@ SYM_CODE_START(trampoline_32bit_src)
>         movl    $MSR_EFER, %ecx
>         rdmsr
>         btsl    $_EFER_LME, %eax
> +       jc      1f
>         wrmsr
> -       popl    %edx
> +1:     popl    %edx
>         popl    %ecx
>
>         /* Enable PAE and LA57 (if required) paging modes */
> @@ -636,7 +637,7 @@ SYM_CODE_START(trampoline_32bit_src)
>         pushl   %eax
>
>         /* Enable paging again */
> -       movl    $(X86_CR0_PG | X86_CR0_PE), %eax
> +       movl    $(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
>         movl    %eax, %cr0
>
>         lret
> diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
> index 6ff7e81b5628..cc9b2529a086 100644
> --- a/arch/x86/boot/compressed/pgtable.h
> +++ b/arch/x86/boot/compressed/pgtable.h
> @@ -6,7 +6,7 @@
>  #define TRAMPOLINE_32BIT_PGTABLE_OFFSET        0
>
>  #define TRAMPOLINE_32BIT_CODE_OFFSET   PAGE_SIZE
> -#define TRAMPOLINE_32BIT_CODE_SIZE     0x70
> +#define TRAMPOLINE_32BIT_CODE_SIZE     0x80
>
>  #define TRAMPOLINE_32BIT_STACK_END     TRAMPOLINE_32BIT_SIZE
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process
  2021-04-26 18:01 ` [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process Kuppuswamy Sathyanarayanan
@ 2021-05-13  3:23   ` Dan Williams
  2021-05-18  0:59     ` [WARNING: UNSCANNABLE EXTRACTION FAILED][WARNING: UNSCANNABLE EXTRACTION FAILED][RFC v2-fix 1/1] x86/boot: Avoid #VE during boot for TDX platforms Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-13  3:23 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List

On Mon, Apr 26, 2021 at 11:03 AM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Skip writing EFER during secondary_startup_64() if the current value is
> also the desired value. This avoids a #VE when running as a TDX guest,
> as the TDX-Module does not allow writes to EFER (even when writing the
> current, fixed value).
>
> Also, preserve CR4.MCE instead of clearing it during boot to avoid a #VE
> when running as a TDX guest. The TDX-Module (effectively part of the
> hypervisor) requires CR4.MCE to be set at all times and injects a #VE
> if the guest attempts to clear CR4.MCE.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/boot/compressed/head_64.S |  5 ++++-
>  arch/x86/kernel/head_64.S          | 13 +++++++++++--
>  2 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index 37c2f37d4a0d..2d79e5f97360 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -622,7 +622,10 @@ SYM_CODE_START(trampoline_32bit_src)
>         popl    %ecx
>
>         /* Enable PAE and LA57 (if required) paging modes */
> -       movl    $X86_CR4_PAE, %eax
> +       movl    %cr4, %eax
> +       /* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
> +       andl    $X86_CR4_MCE, %eax
> +       orl     $X86_CR4_PAE, %eax
>         testl   %edx, %edx
>         jz      1f
>         orl     $X86_CR4_LA57, %eax
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 04bddaaba8e2..92c77cf75542 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -141,7 +141,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
>  1:
>
>         /* Enable PAE mode, PGE and LA57 */
> -       movl    $(X86_CR4_PAE | X86_CR4_PGE), %ecx
> +       movq    %cr4, %rcx
> +       /* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
> +       andl    $X86_CR4_MCE, %ecx
> +       orl     $(X86_CR4_PAE | X86_CR4_PGE), %ecx
>  #ifdef CONFIG_X86_5LEVEL
>         testl   $1, __pgtable_l5_enabled(%rip)
>         jz      1f
> @@ -229,13 +232,19 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
>         /* Setup EFER (Extended Feature Enable Register) */
>         movl    $MSR_EFER, %ecx
>         rdmsr
> +       movl    %eax, %edx

Maybe comment that EFER is being saved here to check if the following
enables are nops, but not a big deal.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

...modulo whether the EFER wrmsr avoidance in PATCH 21 should move here.

>         btsl    $_EFER_SCE, %eax        /* Enable System Call */
>         btl     $20,%edi                /* No Execute supported? */
>         jnc     1f
>         btsl    $_EFER_NX, %eax
>         btsq    $_PAGE_BIT_NX,early_pmd_flags(%rip)
> -1:     wrmsr                           /* Make changes effective */
>
> +       /* Skip the WRMSR if the current value matches the desired value. */
> +1:     cmpl    %edx, %eax
> +       je      1f
> +       xor     %edx, %edx
> +       wrmsr                           /* Make changes effective */
> +1:
>         /* Setup cr0 */
>         movl    $CR0_STATE, %eax
>         /* Make changes effective */
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-12 15:53             ` Sean Christopherson
@ 2021-05-13 16:40               ` Kuppuswamy, Sathyanarayanan
  2021-05-13 17:49                 ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-13 16:40 UTC (permalink / raw)
  To: Sean Christopherson, Dave Hansen
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel



On 5/12/21 8:53 AM, Sean Christopherson wrote:
> On Wed, May 12, 2021, Dave Hansen wrote:
>> On 5/12/21 6:08 AM, Kirill A. Shutemov wrote:
>>>> That's not an excuse to have a bunch of AMD (or Intel) feature-specific
>>>> code in a file named "common".  I'd make an attempt to keep them
>>>> separate and then call into the two separate functions *from* the common
>>>> function.
>>> But why? What good does the additional level of inderection brings?
>>>
>>> It's like saying arch/x86/kernel/cpu/common.c shouldn't have anything AMD
>>> or Intel specific. If a function can cover both vendors I don't see a
>>> point for additinal complexity.
>>
>> Because the code is already separate.  You're actually going to some
>> trouble to move the SEV-specific code and then combine it with the
>> TDX-specific code.
>>
>> Anyway, please just give it a shot.  Should take all of ten minutes.  If
>> it doesn't work out in practice, fine.  You'll have a good paragraph for
>> the changelog.
> 
> Or maybe wait to see how Boris' propose protected_guest_has() pans out?  E.g. if
> we can do "protected_guest_has(MEMORY_ENCRYPTION)" or whatever, then the truly
> common bits could be placed into common.c without any vendor-specific logic.

How about following abstraction? This patch was initially created to enable us use
is_tdx_guest() outside of arch/x86 code. But extended it to support bitmap flags.

commit 188bdd3c97e49020b2bda9efd992a22091423b85
Author: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Date:   Wed May 12 11:35:13 2021 -0700

     tdx: Introduce generic protected_guest abstraction

     Add a generic way to check if we run with an encrypted guest,
     without requiring x86 specific ifdefs. This can then be used in
     non architecture specific code. Enablethis when running under
     TDX/SEV.

     Also add helper functions to set/test encrypted guest feature
     flags.

     Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

diff --git a/arch/Kconfig b/arch/Kconfig
index ecfd3520b676..98c30312555b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -956,6 +956,9 @@ config HAVE_ARCH_NVRAM_OPS
  config ISA_BUS_API
  	def_bool ISA

+config ARCH_HAS_PROTECTED_GUEST
+	bool
+
  #
  # ABI hall of shame
  #
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 07fb4df1d881..001487c21874 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -882,6 +882,7 @@ config INTEL_TDX_GUEST
  	select PARAVIRT_XL
  	select X86_X2APIC
  	select SECURITY_LOCKDOWN_LSM
+	select ARCH_HAS_PROTECTED_GUEST
  	help
  	  Provide support for running in a trusted domain on Intel processors
  	  equipped with Trusted Domain eXtenstions. TDX is a new Intel
@@ -1537,6 +1538,7 @@ config AMD_MEM_ENCRYPT
  	select ARCH_USE_MEMREMAP_PROT
  	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
  	select INSTRUCTION_DECODER
+	select ARCH_HAS_PROTECTED_GUEST
  	help
  	  Say yes to enable support for the encryption of system memory.
  	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ccab6cf91283..8260893c34ae 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -21,6 +21,7 @@
  #include <linux/usb/xhci-dbgp.h>
  #include <linux/static_call.h>
  #include <linux/swiotlb.h>
+#include <linux/protected_guest.h>

  #include <uapi/linux/mount.h>

@@ -107,6 +108,10 @@ static struct resource bss_resource = {
  	.flags	= IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
  };

+#ifdef CONFIG_ARCH_HAS_PROTECTED_GUEST
+DECLARE_BITMAP(protected_guest_flags, PROTECTED_GUEST_BITMAP_LEN);
+EXPORT_SYMBOL(protected_guest_flags);
+#endif

  #ifdef CONFIG_X86_32
  /* CPU data as detected by the assembly code in head_32.S */
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 04a780abb512..45b848ec8325 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -19,6 +19,7 @@
  #include <linux/memblock.h>
  #include <linux/kernel.h>
  #include <linux/mm.h>
+#include <linux/protected_guest.h>

  #include <asm/cpu_entry_area.h>
  #include <asm/stacktrace.h>
@@ -680,6 +681,9 @@ static void __init init_ghcb(int cpu)

  	data->ghcb_active = false;
  	data->backup_ghcb_active = false;
+
+	set_protected_guest_flag(GUEST_TYPE_SEV);
+	set_protected_guest_flag(MEMORY_ENCRYPTION);
  }

  void __init sev_es_init_vc_handling(void)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 4dfacde05f0c..d0207b990fe4 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,7 @@
  #include <asm/vmx.h>

  #include <linux/cpu.h>
+#include <linux/protected_guest.h>

  static struct {
  	unsigned int gpa_width;
@@ -92,6 +93,9 @@ void __init tdx_early_init(void)

  	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);

+	set_protected_guest_flag(GUEST_TYPE_TDX);
+	set_protected_guest_flag(MEMORY_ENCRYPTION);
+
  	tdg_get_info();

  	pr_info("TDX guest is initialized\n");
diff --git a/include/linux/protected_guest.h b/include/linux/protected_guest.h
new file mode 100644
index 000000000000..44e8c642654c
--- /dev/null
+++ b/include/linux/protected_guest.h
@@ -0,0 +1,37 @@
+#ifndef _LINUX_PROTECTED_GUEST_H
+#define _LINUX_PROTECTED_GUEST_H 1
+
+#define PROTECTED_GUEST_BITMAP_LEN	128
+
+/* Protected Guest vendor types */
+#define GUEST_TYPE_TDX			(1)
+#define GUEST_TYPE_SEV			(2)
+
+/* Protected Guest features */
+#define MEMORY_ENCRYPTION		(20)
+
+#ifdef CONFIG_ARCH_HAS_PROTECTED_GUEST
+extern DECLARE_BITMAP(protected_guest_flags, PROTECTED_GUEST_BITMAP_LEN);
+
+static bool protected_guest_has(unsigned long flag)
+{
+	return test_bit(flag, protected_guest_flags);
+}
+
+static inline void set_protected_guest_flag(unsigned long flag)
+{
+	__set_bit(flag, protected_guest_flags);
+}
+
+static inline bool is_protected_guest(void)
+{
+	return ( protected_guest_has(GUEST_TYPE_TDX) |
+		 protected_guest_has(GUEST_TYPE_SEV) );
+}
+#else
+static inline bool protected_guest_has(unsigned long flag) { return false; }
+static inline void set_protected_guest_flag(unsigned long flag) { }
+static inline bool is_protected_guest(void) { return false; }
+#endif
+
+#endif


> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-13 16:40               ` Kuppuswamy, Sathyanarayanan
@ 2021-05-13 17:49                 ` Dave Hansen
  2021-05-13 18:17                   ` Kuppuswamy, Sathyanarayanan
  2021-05-13 19:38                   ` Andi Kleen
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-13 17:49 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Sean Christopherson
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel

On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
> 
> +#define PROTECTED_GUEST_BITMAP_LEN    128
> +
> +/* Protected Guest vendor types */
> +#define GUEST_TYPE_TDX            (1)
> +#define GUEST_TYPE_SEV            (2)
> +
> +/* Protected Guest features */
> +#define MEMORY_ENCRYPTION        (20)

I was assuming we'd reuse the X86_FEATURE infrastructure somehow.  Is
there a good reason not to?

That gives us all the compile-time optimization (via
en/disabled-features.h) and static branches for "free".

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-13 17:49                 ` Dave Hansen
@ 2021-05-13 18:17                   ` Kuppuswamy, Sathyanarayanan
  2021-05-13 19:38                   ` Andi Kleen
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-13 18:17 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel



On 5/13/21 10:49 AM, Dave Hansen wrote:
> On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
>>
>> +#define PROTECTED_GUEST_BITMAP_LEN    128
>> +
>> +/* Protected Guest vendor types */
>> +#define GUEST_TYPE_TDX            (1)
>> +#define GUEST_TYPE_SEV            (2)
>> +
>> +/* Protected Guest features */
>> +#define MEMORY_ENCRYPTION        (20)
> 
> I was assuming we'd reuse the X86_FEATURE infrastructure somehow.  Is
> there a good reason not to?

My assumption is, protected guest abstraction can be also used by
non-x86 arch's in future. So I have tried to keep these definitions
in common code.


> 
> That gives us all the compile-time optimization (via
> en/disabled-features.h) and static branches for "free".
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-05-12 14:29           ` Dave Hansen
@ 2021-05-13 19:29             ` Kuppuswamy, Sathyanarayanan
  2021-05-13 19:33               ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-13 19:29 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov
  Cc: Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata



On 5/12/21 7:29 AM, Dave Hansen wrote:
> On 5/12/21 7:10 AM, Kuppuswamy, Sathyanarayanan wrote:
>> On 5/12/21 6:00 AM, Kirill A. Shutemov wrote:
>>> This has to be compiled only for TDX+KVM.
>>
>> Got it. So if we want to remove the "C" file include, we will have to
>> add #ifdef CONFIG_KVM_GUEST in Makefile.
>>
>> ifdef CONFIG_KVM_GUEST
>> obj-$(CONFIG_INTEL_TDX_GUEST) += tdx-kvm.o
>> #endif
> 
> Is there truly no dependency between CONFIG_KVM_GUEST and
> CONFIG_INTEL_TDX_GUEST?

We want to re-use TDX code with other hypervisors/guests as well. So
we can't create direct dependency with CONFIG_KVM_GUEST in Kconfig.

> 
> If there isn't, then the way we do it is adding another (invisible)
> Kconfig variable to express the dependency for tdx-kvm.o:
> 
> config INTEL_TDX_GUEST_KVM
> 	bool
> 	depends on KVM_GUEST && INTEL_TDX_GUEST

Currently it will only be used for KVM hypercall code. Will it to be
overkill to create a new config over #ifdefs for this use case ? But,
if this is the preferred approach, I will go with this suggestion.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 10/32] x86/tdx: Wire up KVM hypercalls
  2021-05-13 19:29             ` Kuppuswamy, Sathyanarayanan
@ 2021-05-13 19:33               ` Dave Hansen
  2021-05-18  0:15                 ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-13 19:33 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Kirill A. Shutemov
  Cc: Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel, Isaku Yamahata

On 5/13/21 12:29 PM, Kuppuswamy, Sathyanarayanan wrote:
>> If there isn't, then the way we do it is adding another (invisible)
>> Kconfig variable to express the dependency for tdx-kvm.o:
>>
>> config INTEL_TDX_GUEST_KVM
>>     bool
>>     depends on KVM_GUEST && INTEL_TDX_GUEST
> 
> Currently it will only be used for KVM hypercall code. Will it to be
> overkill to create a new config over #ifdefs for this use case ? But,
> if this is the preferred approach, I will go with this suggestion.

You'll see this done lots of different (valid) ways over the kernel.
(#ifdef'd #including C files is not one of them.)

*My* preference is to use Kconfig in the way I described.  It keeps
makefiles and #ifdef's clean and obvious, relegating the logic to Kconfig.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-13 17:49                 ` Dave Hansen
  2021-05-13 18:17                   ` Kuppuswamy, Sathyanarayanan
@ 2021-05-13 19:38                   ` Andi Kleen
  2021-05-13 19:42                     ` Dave Hansen
  2021-05-17 18:16                     ` Sean Christopherson
  1 sibling, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-13 19:38 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy, Sathyanarayanan, Sean Christopherson
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel


On 5/13/2021 10:49 AM, Dave Hansen wrote:
> On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
>> +#define PROTECTED_GUEST_BITMAP_LEN    128
>> +
>> +/* Protected Guest vendor types */
>> +#define GUEST_TYPE_TDX            (1)
>> +#define GUEST_TYPE_SEV            (2)
>> +
>> +/* Protected Guest features */
>> +#define MEMORY_ENCRYPTION        (20)
> I was assuming we'd reuse the X86_FEATURE infrastructure somehow.  Is
> there a good reason not to?


This for generic code. Would be a gigantic lift and lots of refactoring 
to move that out.

>
> That gives us all the compile-time optimization (via
> en/disabled-features.h) and static branches for "free".

There's no user so far which is anywhere near performance critical, so 
that would be total overkil

BTW right now I'm not even sure we need the bitmap for anything, but I 
guess it doesn't hurt.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-13 19:38                   ` Andi Kleen
@ 2021-05-13 19:42                     ` Dave Hansen
  2021-05-17 18:16                     ` Sean Christopherson
  1 sibling, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-13 19:42 UTC (permalink / raw)
  To: Andi Kleen, Kuppuswamy, Sathyanarayanan, Sean Christopherson
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel

On 5/13/21 12:38 PM, Andi Kleen wrote:
> 
> On 5/13/2021 10:49 AM, Dave Hansen wrote:
>> On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
>>> +#define PROTECTED_GUEST_BITMAP_LEN    128
>>> +
>>> +/* Protected Guest vendor types */
>>> +#define GUEST_TYPE_TDX            (1)
>>> +#define GUEST_TYPE_SEV            (2)
>>> +
>>> +/* Protected Guest features */
>>> +#define MEMORY_ENCRYPTION        (20)
>> I was assuming we'd reuse the X86_FEATURE infrastructure somehow.  Is
>> there a good reason not to?
> 
> This for generic code. Would be a gigantic lift and lots of refactoring
> to move that out.

Ahh, forgot about that.  The whole "x86/mm" subject threw me off.

>> That gives us all the compile-time optimization (via
>> en/disabled-features.h) and static branches for "free".
> 
> There's no user so far which is anywhere near performance critical, so
> that would be total overkil

The *REALLY* nice thing is that it keeps you from having to create stub
functions or #ifdefs and yet the compiler can still optimize the code to
nothing.

Anyway, thanks for the clarification about it being in non-arch code.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-05-07 21:36   ` Dave Hansen
@ 2021-05-13 19:47     ` Andi Kleen
  2021-05-13 20:07       ` Dave Hansen
  2021-05-13 20:14       ` Dave Hansen
  2021-05-18  0:09     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
  1 sibling, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-13 19:47 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson


On 5/7/2021 2:36 PM, Dave Hansen wrote:
> On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> ...
>> The #VE cannot be nested before TDGETVEINFO is called, if there is any
>> reason for it to nest the TD would shut down. The TDX module guarantees
>> that no NMIs (or #MC or similar) can happen in this window. After
>> TDGETVEINFO the #VE handler can nest if needed, although we don’t expect
>> it to happen normally.
> I think this description really needs some work.  Does "The #VE cannot
> be nested" mean that "hardware guarantees that #VE will not be
> generated", or "the #VE must not be nested"?

The next half sentence answers this question..

"if there is any reason for it to nest the TD would shut down."

So it cannot nest.


>
> What does "the TD would shut down" mean?  I think you mean that instead
> of delivering a nested #VE the hardware would actually exit to the host
> and TDX would prevent the guest from being reentered.  Right?


Yes that's a shutdown. I Suppose we could add your sentence.


> I find that description a bit unsatisfying.  Could we make this a bit
> more concrete?


I don't see what could be added. If you have concrete suggestions please 
just propose something.


>   By the way, what about *normal* interrupts?


Normal interrupts are blocked of course like in every other exception or 
interrupt entry.

>
> Maybe we should talk about this in terms of *rules* that folks need to
> follow.  Maybe:
>
> 	NMIs and machine checks are suppressed.  Before this point any
> 	#VE is fatal.  After this point, NMIs and additional #VEs are
> 	permitted.

Okay that's fine for me.


-Andi




^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-05-13 19:47     ` Andi Kleen
@ 2021-05-13 20:07       ` Dave Hansen
  2021-05-13 22:43         ` Andi Kleen
  2021-05-13 20:14       ` Dave Hansen
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-13 20:07 UTC (permalink / raw)
  To: Andi Kleen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson

On 5/13/21 12:47 PM, Andi Kleen wrote:
> "if there is any reason for it to nest the TD would shut down."

The TDX EAS says:

> If, when attempting to inject a #VE, the Intel TDX module discovers
> that the guest TD has not yet retrieved the information for a
> previous #VE (i.e., VE_INFO.VALID is not 0), the TDX module injects a
> #DF into the guest TD to indicate a #VE overrun.

How does that result in a shut down?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-05-13 19:47     ` Andi Kleen
  2021-05-13 20:07       ` Dave Hansen
@ 2021-05-13 20:14       ` Dave Hansen
  1 sibling, 0 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-13 20:14 UTC (permalink / raw)
  To: Andi Kleen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson

On 5/13/21 12:47 PM, Andi Kleen wrote:
> I don't see what could be added. If you have concrete suggestions please
> just propose something.

Oh, boy, I love writing changelogs!  I was hoping that the TDX folks
would chip in to write their own changelogs, but oh well.  You made my day!

--

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either userspace or the kernel:

 * Specific instructions (WBINVD, for example)
 * Specific MSR accesses
 * Specific CPUID leaf accesses
 * Access to TD-shared memory, which includes MMIO

#VE exceptions are never generated on accesses to normal, TD-private memory.

The entry paths do not access TD-shared memory or use those specific
MSRs, instructions, CPUID leaves.  In addition, all interrupts including
NMIs are blocked by the hardware starting with #VE delivery until
TDGETVEINFO is called.  This eliminates the chance of a #VE during the
syscall gap or paranoid entry paths and simplifies #VE handling.

If a guest kernel action which would normally cause a #VE occurs in the
interrupt-disabled region before TDGETVEINFO, a #DF is delivered to the
guest.

Add basic infrastructure to handle any #VE which occurs in the kernel or
userspace.  Later patches will add handling for specific #VE scenarios.

Convert unhandled #VE's (everything, until later in this series) so that
they appear just like a #GP by calling do_general_protection() directly.

--

Did I miss anything?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 08/32] x86/traps: Add #VE support for TDX guest
  2021-05-13 20:07       ` Dave Hansen
@ 2021-05-13 22:43         ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-13 22:43 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Dan Williams, Tony Luck
  Cc: Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson


On 5/13/2021 1:07 PM, Dave Hansen wrote:
> On 5/13/21 12:47 PM, Andi Kleen wrote:
>> "if there is any reason for it to nest the TD would shut down."
> The TDX EAS says:
>
>> If, when attempting to inject a #VE, the Intel TDX module discovers
>> that the guest TD has not yet retrieved the information for a
>> previous #VE (i.e., VE_INFO.VALID is not 0), the TDX module injects a
>> #DF into the guest TD to indicate a #VE overrun.
> How does that result in a shut down?


You're right. It's not a shutdown, but a panic. We'll need to fix the 
comment and replace 'shutdown' with 'panic'


-And






^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-13 19:38                   ` Andi Kleen
  2021-05-13 19:42                     ` Dave Hansen
@ 2021-05-17 18:16                     ` Sean Christopherson
  2021-05-17 18:27                       ` Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-17 18:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Kuppuswamy, Sathyanarayanan, Kirill A. Shutemov,
	Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel

On Thu, May 13, 2021, Andi Kleen wrote:
> 
> On 5/13/2021 10:49 AM, Dave Hansen wrote:
> > On 5/13/21 9:40 AM, Kuppuswamy, Sathyanarayanan wrote:
> > > +#define PROTECTED_GUEST_BITMAP_LEN    128
> > > +
> > > +/* Protected Guest vendor types */
> > > +#define GUEST_TYPE_TDX            (1)
> > > +#define GUEST_TYPE_SEV            (2)
> > > +
> > > +/* Protected Guest features */
> > > +#define MEMORY_ENCRYPTION        (20)
> > I was assuming we'd reuse the X86_FEATURE infrastructure somehow.  Is
> > there a good reason not to?
> 
> This for generic code. Would be a gigantic lift and lots of refactoring to
> move that out.

What generic code needs access to SEV vs. TDX?  force_dma_unencrypted() is called
from generic code, but its implementation is x86 specific.

> > That gives us all the compile-time optimization (via
> > en/disabled-features.h) and static branches for "free".
> 
> There's no user so far which is anywhere near performance critical, so that
> would be total overkil

SEV already has the sev_enable_key static key that it uses for unrolling string
I/O, so there's at least one (debatable) case that wants to use static branches.

For SEV-ES and TDX, there's a better argument as using X86_FEATURE_* would unlock
alternatives.

> BTW right now I'm not even sure we need the bitmap for anything, but I guess
> it doesn't hurt.
> 
> -Andi
> 
> 

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-17 18:16                     ` Sean Christopherson
@ 2021-05-17 18:27                       ` Kuppuswamy, Sathyanarayanan
  2021-05-17 18:33                         ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-17 18:27 UTC (permalink / raw)
  To: Sean Christopherson, Andi Kleen
  Cc: Dave Hansen, Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel



On 5/17/21 11:16 AM, Sean Christopherson wrote:
> What generic code needs access to SEV vs. TDX?  force_dma_unencrypted() is called
> from generic code, but its implementation is x86 specific.

When the hardening the drivers for TDX usage, we will have requirement to check
for is_protected_guest() to add code specific to protected guests. Since this will
be outside arch/x86, we need common framework for it.

Few examples are,
  * ACPI sleep driver uses WBINVD (when doing cache flushes). We want to skip it for
   TDX.
  * Forcing virtio to use dma API when running with untrusted host.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-17 18:27                       ` Kuppuswamy, Sathyanarayanan
@ 2021-05-17 18:33                         ` Dave Hansen
  2021-05-17 18:37                           ` Sean Christopherson
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-17 18:33 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Sean Christopherson, Andi Kleen
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel

On 5/17/21 11:27 AM, Kuppuswamy, Sathyanarayanan wrote:
> On 5/17/21 11:16 AM, Sean Christopherson wrote:
>> What generic code needs access to SEV vs. TDX? 
>> force_dma_unencrypted() is called from generic code, but its
>> implementation is x86 specific.
> 
> When the hardening the drivers for TDX usage, we will have
> requirement to check for is_protected_guest() to add code specific to
> protected guests. Since this will be outside arch/x86, we need common
> framework for it.

Just remember, a "common framework" doesn't mean that it can't be backed
by extremely arch-specific mechanisms.

For instance, there's a lot of pkey-specific code in mm/mprotect.c.  It
still gets optimized away on x86 with all the goodness of X86_FEATUREs.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-17 18:33                         ` Dave Hansen
@ 2021-05-17 18:37                           ` Sean Christopherson
  2021-05-17 22:32                             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-17 18:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy, Sathyanarayanan, Andi Kleen, Kirill A. Shutemov,
	Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Raj Ashok,
	linux-kernel

On Mon, May 17, 2021, Dave Hansen wrote:
> On 5/17/21 11:27 AM, Kuppuswamy, Sathyanarayanan wrote:
> > On 5/17/21 11:16 AM, Sean Christopherson wrote:
> >> What generic code needs access to SEV vs. TDX? 
> >> force_dma_unencrypted() is called from generic code, but its
> >> implementation is x86 specific.
> > 
> > When the hardening the drivers for TDX usage, we will have
> > requirement to check for is_protected_guest() to add code specific to
> > protected guests. Since this will be outside arch/x86, we need common
> > framework for it.
> 
> Just remember, a "common framework" doesn't mean that it can't be backed
> by extremely arch-specific mechanisms.
> 
> For instance, there's a lot of pkey-specific code in mm/mprotect.c.  It
> still gets optimized away on x86 with all the goodness of X86_FEATUREs.

Ya, exactly.  Ideally, generic code shouldn't have to differentiate between SEV,
SEV-ES, SEV-SNP, TDX, etc..., a vanilla "bool is_protected_guest(void)" should
suffice.  Under the hood, x86's implementation for is_protected_guest() can be
boot_cpu_has() checks (if we want).

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-17 18:37                           ` Sean Christopherson
@ 2021-05-17 22:32                             ` Kuppuswamy, Sathyanarayanan
  2021-05-17 23:11                               ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-17 22:32 UTC (permalink / raw)
  To: Sean Christopherson, Dave Hansen
  Cc: Andi Kleen, Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel



On 5/17/21 11:37 AM, Sean Christopherson wrote:
>> Just remember, a "common framework" doesn't mean that it can't be backed
>> by extremely arch-specific mechanisms.
>>
>> For instance, there's a lot of pkey-specific code in mm/mprotect.c.  It
>> still gets optimized away on x86 with all the goodness of X86_FEATUREs.
> Ya, exactly.  Ideally, generic code shouldn't have to differentiate between SEV,
> SEV-ES, SEV-SNP, TDX, etc..., a vanilla "bool is_protected_guest(void)" should
> suffice.  Under the hood, x86's implementation for is_protected_guest() can be
> boot_cpu_has() checks (if we want).

What about the use case of protected_guest_has(flag)? Do you want to call it with
with X86_FEATURE_* flags outside arch/x86 code ?


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-17 22:32                             ` Kuppuswamy, Sathyanarayanan
@ 2021-05-17 23:11                               ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-17 23:11 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Sean Christopherson, Dave Hansen
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andy Lutomirski,
	Dan Williams, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Raj Ashok, linux-kernel


On 5/17/2021 3:32 PM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 5/17/21 11:37 AM, Sean Christopherson wrote:
>>> Just remember, a "common framework" doesn't mean that it can't be 
>>> backed
>>> by extremely arch-specific mechanisms.
>>>
>>> For instance, there's a lot of pkey-specific code in mm/mprotect.c.  It
>>> still gets optimized away on x86 with all the goodness of X86_FEATUREs.
>> Ya, exactly.  Ideally, generic code shouldn't have to differentiate 
>> between SEV,
>> SEV-ES, SEV-SNP, TDX, etc..., a vanilla "bool 
>> is_protected_guest(void)" should
>> suffice.  Under the hood, x86's implementation for 
>> is_protected_guest() can be
>> boot_cpu_has() checks (if we want).
>
> What about the use case of protected_guest_has(flag)? Do you want to 
> call it with
> with X86_FEATURE_* flags outside arch/x86 code ?


I don't think we need any flags in the generic code. Just a simple bool 
is enough.


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix 1/1] x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT
  2021-05-12 13:51               ` Juergen Gross
@ 2021-05-17 23:50                 ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-17 23:50 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Borislav Petkov
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

CONFIG_PARAVIRT_XXL is mainly defined/used by XEN PV guests. For
other VM guest types, features supported under CONFIG_PARAVIRT
are self sufficient. CONFIG_PARAVIRT mainly provides support for
TLB flush operations and time related operations.

For TDX guest as well, paravirt calls under CONFIG_PARVIRT meets
most of its requirement except the need of HLT and SAFE_HLT
paravirt calls, which is currently defined under
COFNIG_PARAVIRT_XXL.

Since enabling CONFIG_PARAVIRT_XXL is too bloated for TDX guest
like platforms, move HLT and SAFE_HLT paravirt calls under
CONFIG_PARAVIRT.

Moving HLT and SAFE_HLT paravirt calls are not fatal and should not
break any functionality for current users of CONFIG_PARAVIRT.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---

Changes since v1:
 * Removed CONFIG_PARAVIRT_XL
 * Moved HLT and SAFE_HLT under CONFIG_PARAVIRT

 arch/x86/include/asm/irqflags.h       | 40 +++++++++++++++------------
 arch/x86/include/asm/paravirt.h       | 20 +++++++-------
 arch/x86/include/asm/paravirt_types.h |  3 +-
 arch/x86/kernel/paravirt.c            |  4 ++-
 4 files changed, 36 insertions(+), 31 deletions(-)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 144d70ea4393..6671744dbf3c 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -59,6 +59,28 @@ static inline __cpuidle void native_halt(void)
 
 #endif
 
+#ifndef CONFIG_PARAVIRT
+#ifndef __ASSEMBLY__
+/*
+ * Used in the idle loop; sti takes one instruction cycle
+ * to complete:
+ */
+static inline __cpuidle void arch_safe_halt(void)
+{
+	native_safe_halt();
+}
+
+/*
+ * Used when interrupts are already enabled or to
+ * shutdown the processor:
+ */
+static inline __cpuidle void halt(void)
+{
+	native_halt();
+}
+#endif /* __ASSEMBLY__ */
+#endif /* CONFIG_PARAVIRT */
+
 #ifdef CONFIG_PARAVIRT_XXL
 #include <asm/paravirt.h>
 #else
@@ -80,24 +102,6 @@ static __always_inline void arch_local_irq_enable(void)
 	native_irq_enable();
 }
 
-/*
- * Used in the idle loop; sti takes one instruction cycle
- * to complete:
- */
-static inline __cpuidle void arch_safe_halt(void)
-{
-	native_safe_halt();
-}
-
-/*
- * Used when interrupts are already enabled or to
- * shutdown the processor:
- */
-static inline __cpuidle void halt(void)
-{
-	native_halt();
-}
-
 /*
  * For spinlocks, etc:
  */
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4abf110e2243..5d967bce8937 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,6 +84,16 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 	PVOP_VCALL1(mmu.exit_mmap, mm);
 }
 
+static inline void arch_safe_halt(void)
+{
+	PVOP_VCALL0(irq.safe_halt);
+}
+
+static inline void halt(void)
+{
+	PVOP_VCALL0(irq.halt);
+}
+
 #ifdef CONFIG_PARAVIRT_XXL
 static inline void load_sp0(unsigned long sp0)
 {
@@ -145,16 +155,6 @@ static inline void __write_cr4(unsigned long x)
 	PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-static inline void arch_safe_halt(void)
-{
-	PVOP_VCALL0(irq.safe_halt);
-}
-
-static inline void halt(void)
-{
-	PVOP_VCALL0(irq.halt);
-}
-
 static inline void wbinvd(void)
 {
 	PVOP_VCALL0(cpu.wbinvd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index de87087d3bde..68bf35ce6dd5 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -177,10 +177,9 @@ struct pv_irq_ops {
 	struct paravirt_callee_save save_fl;
 	struct paravirt_callee_save irq_disable;
 	struct paravirt_callee_save irq_enable;
-
+#endif
 	void (*safe_halt)(void);
 	void (*halt)(void);
-#endif
 } __no_randomize_layout;
 
 struct pv_mmu_ops {
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c60222ab8ab9..b001f5aaee4a 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -322,9 +322,11 @@ struct paravirt_patch_template pv_ops = {
 	.irq.save_fl		= __PV_IS_CALLEE_SAVE(native_save_fl),
 	.irq.irq_disable	= __PV_IS_CALLEE_SAVE(native_irq_disable),
 	.irq.irq_enable		= __PV_IS_CALLEE_SAVE(native_irq_enable),
+#endif /* CONFIG_PARAVIRT_XXL */
+
+	/* Irq HLT ops. */
 	.irq.safe_halt		= native_safe_halt,
 	.irq.halt		= native_halt,
-#endif /* CONFIG_PARAVIRT_XXL */
 
 	/* Mmu ops. */
 	.mmu.flush_tlb_user	= native_flush_tlb_local,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-07 21:36   ` Dave Hansen
  2021-05-13 19:47     ` Andi Kleen
@ 2021-05-18  0:09     ` Kuppuswamy Sathyanarayanan
  2021-05-18 15:11       ` Dave Hansen
  2021-05-21 18:45       ` [RFC v2-fix " Kuppuswamy, Sathyanarayanan
  1 sibling, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-18  0:09 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the kernel:

 * Specific instructions (WBINVD, for example)
 * Specific MSR accesses
 * Specific CPUID leaf accesses
 * Access to TD-shared memory, which includes MMIO

In the settings that Linux will run in, virtual exceptions are never
generated on accesses to normal, TD-private memory that has been
accepted.

The entry paths do not access TD-shared memory, MMIO regions or use
those specific MSRs, instructions, CPUID leaves that might generate #VE.
In addition, all interrupts including NMIs are blocked by the hardware
starting with #VE delivery until TDGETVEINFO is called.  This eliminates
the chance of a #VE during the syscall gap or paranoid entry paths and
simplifies #VE handling.

After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
although we don't expect it to happen because we don't expect NMIs to
trigger #VEs. Another case where they could happen is if the #VE
exception panics, but in this case there are no guarantees on anything
anyways.

If a guest kernel action which would normally cause a #VE occurs in the
interrupt-disabled region before TDGETVEINFO, a #DF is delivered to the
guest which will result in an oops (and should eventually be a panic, as
we would like to set panic_on_oops to 1 for TDX guests).

Add basic infrastructure to handle any #VE which occurs in the kernel or
userspace.  Later patches will add handling for specific #VE scenarios.

Convert unhandled #VE's (everything, until later in this series) so that
they appear just like a #GP by calling ve_raise_fault() directly.
ve_raise_fault() is similar to #GP handler and is responsible for
sending SIGSEGV to userspace and cpu die and notifying debuggers and
other die chain users.  

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since v1:
 * Removed [RFC v2 07/32] x86/traps: Add do_general_protection() helper function.
 * Instead of resuing #GP handler, defined a custom handler.
 * Fixed commit log as per review comments.

 arch/x86/include/asm/idtentry.h |  4 ++
 arch/x86/include/asm/tdx.h      | 20 ++++++++++
 arch/x86/kernel/idt.c           |  6 +++
 arch/x86/kernel/tdx.c           | 35 +++++++++++++++++
 arch/x86/kernel/traps.c         | 70 +++++++++++++++++++++++++++++++++
 5 files changed, 135 insertions(+)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5eb3bdf36a41..41a0732d5f68 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -619,6 +619,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER,	exc_xen_hypervisor_callback);
 DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER,	exc_xen_unknown_trap);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE,		exc_virtualization_exception);
+#endif
+
 /* Device interrupts common/spurious */
 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	common_interrupt);
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 1d75be21a09b..8ab4067afefc 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -11,6 +11,7 @@
 #include <linux/types.h>
 
 #define TDINFO			1
+#define TDGETVEINFO		3
 
 struct tdx_module_output {
 	u64 rcx;
@@ -29,6 +30,25 @@ struct tdx_hypercall_output {
 	u64 r15;
 };
 
+/*
+ * Used by #VE exception handler to gather the #VE exception
+ * info from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct ve_info {
+	u64 exit_reason;
+	u64 exit_qual;
+	u64 gla;
+	u64 gpa;
+	u32 instr_len;
+	u32 instr_info;
+};
+
+unsigned long tdg_get_ve_info(struct ve_info *ve);
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+		struct ve_info *ve);
+
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..546b6b636c7d 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -64,6 +64,9 @@ static const __initconst struct idt_data early_idts[] = {
 	 */
 	INTG(X86_TRAP_PF,		asm_exc_page_fault),
 #endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
+#endif
 };
 
 /*
@@ -87,6 +90,9 @@ static const __initconst struct idt_data def_idts[] = {
 	INTG(X86_TRAP_MF,		asm_exc_coprocessor_error),
 	INTG(X86_TRAP_AC,		asm_exc_alignment_check),
 	INTG(X86_TRAP_XF,		asm_exc_simd_coprocessor_error),
+#ifdef CONFIG_INTEL_TDX_GUEST
+	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
+#endif
 
 #ifdef CONFIG_X86_32
 	TSKG(X86_TRAP_DF,		GDT_ENTRY_DOUBLEFAULT_TSS),
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 4dfacde05f0c..b5fffbd86331 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -85,6 +85,41 @@ static void tdg_get_info(void)
 	td_info.attributes = out.rdx;
 }
 
+unsigned long tdg_get_ve_info(struct ve_info *ve)
+{
+	u64 ret;
+	struct tdx_module_output out = {0};
+
+	/*
+	 * NMIs and machine checks are suppressed. Before this point any
+	 * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
+	 * additional #VEs are permitted (but we don't expect them to
+	 * happen unless you panic).
+	 */
+	ret = __tdx_module_call(TDGETVEINFO, 0, 0, 0, 0, &out);
+
+	ve->exit_reason = out.rcx;
+	ve->exit_qual   = out.rdx;
+	ve->gla         = out.r8;
+	ve->gpa         = out.r9;
+	ve->instr_len   = out.r10 & UINT_MAX;
+	ve->instr_info  = out.r10 >> 32;
+
+	return ret;
+}
+
+int tdg_handle_virtualization_exception(struct pt_regs *regs,
+		struct ve_info *ve)
+{
+	/*
+	 * TODO: Add handler support for various #VE exit
+	 * reasons. It will be added by other patches in
+	 * the series.
+	 */
+	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+	return -EFAULT;
+}
+
 void __init tdx_early_init(void)
 {
 	if (!cpuid_has_tdx_guest())
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 651e3e508959..af8efa2e57ba 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/vdso.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_X86_64
 #include <asm/x86_init.h>
@@ -1137,6 +1138,75 @@ DEFINE_IDTENTRY(exc_device_not_available)
 	}
 }
 
+#define VEFSTR "VE fault"
+static void ve_raise_fault(struct pt_regs *regs, long error_code)
+{
+	struct task_struct *tsk = current;
+
+	if (user_mode(regs)) {
+		tsk->thread.error_code = error_code;
+		tsk->thread.trap_nr = X86_TRAP_VE;
+
+		/*
+		 * Not fixing up VDSO exceptions similar to #GP handler
+		 * because we don't expect the VDSO to trigger #VE.
+		 */
+		show_signal(tsk, SIGSEGV, "", VEFSTR, regs, error_code);
+		force_sig(SIGSEGV);
+		return;
+	}
+
+
+	if (fixup_exception(regs, X86_TRAP_VE, error_code, 0))
+		return;
+
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_VE;
+
+	/*
+	 * To be potentially processing a kprobe fault and to trust the result
+	 * from kprobe_running(), we have to be non-preemptible.
+	 */
+	if (!preemptible() &&
+	    kprobe_running() &&
+	    kprobe_fault_handler(regs, X86_TRAP_VE))
+		return;
+
+	notify_die(DIE_GPF, VEFSTR, regs, error_code, X86_TRAP_VE, SIGSEGV);
+
+	die_addr(VEFSTR, regs, error_code, 0);
+}
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+	struct ve_info ve;
+	int ret;
+
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+
+	/*
+	 * NMIs/Machine-checks/Interrupts will be in a disabled state
+	 * till TDGETVEINFO TDCALL is executed. This prevents #VE
+	 * nesting issue.
+	 */
+	ret = tdg_get_ve_info(&ve);
+
+	cond_local_irq_enable(regs);
+
+	if (!ret)
+		ret = tdg_handle_virtualization_exception(regs, &ve);
+	/*
+	 * If tdg_handle_virtualization_exception() could not process
+	 * it successfully, treat it as #GP(0) and handle it.
+	 */
+	if (ret)
+		ve_raise_fault(regs, 0);
+
+	cond_local_irq_disable(regs);
+}
+#endif
+
 #ifdef CONFIG_X86_32
 DEFINE_IDTENTRY_SW(iret_error)
 {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-13 19:33               ` Dave Hansen
@ 2021-05-18  0:15                 ` Kuppuswamy Sathyanarayanan
  2021-05-18 15:51                   ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-18  0:15 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Isaku Yamahata,
	Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

KVM hypercalls use the "vmcall" or "vmmcall" instructions.
Although the ABI is similar, those instructions no longer
function for TDX guests. Make vendor specififc TDVMCALLs
instead of VMCALL.

[Isaku: proposed KVM VENDOR string]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
Changes since RFC v2:
 * Introduced INTEL_TDX_GUEST_KVM config for TDX+KVM related changes.
 * Removed "C" include file.
 * Fixed commit log as per Dave's comments.

 arch/x86/Kconfig                |  6 +++++
 arch/x86/include/asm/kvm_para.h | 21 +++++++++++++++
 arch/x86/include/asm/tdx.h      | 41 ++++++++++++++++++++++++++++
 arch/x86/kernel/Makefile        |  1 +
 arch/x86/kernel/tdcall.S        | 20 ++++++++++++++
 arch/x86/kernel/tdx-kvm.c       | 48 +++++++++++++++++++++++++++++++++
 6 files changed, 137 insertions(+)
 create mode 100644 arch/x86/kernel/tdx-kvm.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9e0e0ff76bab..768df1b98487 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -886,6 +886,12 @@ config INTEL_TDX_GUEST
 	  run in a CPU mode that protects the confidentiality of TD memory
 	  contents and the TD’s CPU state from other software, including VMM.
 
+config INTEL_TDX_GUEST_KVM
+	def_bool y
+	depends on KVM_GUEST && INTEL_TDX_GUEST
+	help
+	 This option enables KVM specific hypercalls in TDX guest.
+
 endif #HYPERVISOR_GUEST
 
 source "arch/x86/Kconfig.cpu"
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..2fa85481520b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,6 +6,7 @@
 #include <asm/alternative.h>
 #include <linux/interrupt.h>
 #include <uapi/asm/kvm_para.h>
+#include <asm/tdx.h>
 
 extern void kvmclock_init(void);
 
@@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 static inline long kvm_hypercall0(unsigned int nr)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall0(nr);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr)
@@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
 static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall1(nr, p1);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1)
@@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
 				  unsigned long p2)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall2(nr, p1, p2);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2)
@@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
 				  unsigned long p2, unsigned long p3)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall3(nr, p1, p2, p3);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 				  unsigned long p4)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8ab4067afefc..eb758b506dba 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -73,4 +73,45 @@ static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
+			       u64 r15, struct tdx_hypercall_output *out);
+long tdx_kvm_hypercall0(unsigned int nr);
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1);
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2);
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3);
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3, unsigned long p4);
+#else
+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+				      unsigned long p2)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3,
+				      unsigned long p4)
+{
+	return -ENODEV;
+}
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
+
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 7966c10ea8d1..a90fec004844 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -128,6 +128,7 @@ obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
 obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index a484c4aef6e6..3c57a1d67b79 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -25,6 +25,8 @@
 					  TDG_R12 | TDG_R13 | \
 					  TDG_R14 | TDG_R15 )
 
+#define TDVMCALL_VENDOR_KVM		0x4d564b2e584454 /* "TDX.KVM" */
+
 /*
  * TDX guests use the TDCALL instruction to make requests to the
  * TDX module and hypercalls to the VMM. It is supported in
@@ -213,3 +215,21 @@ SYM_FUNC_START(__tdx_hypercall)
 	call do_tdx_hypercall
 	retq
 SYM_FUNC_END(__tdx_hypercall)
+
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+/*
+ * Helper function for KVM vendor TDVMCALLs. This assembly wrapper
+ * lets us reuse do_tdvmcall() for KVM-specific hypercalls (
+ * TDVMCALL_VENDOR_KVM).
+ */
+SYM_FUNC_START(__tdx_hypercall_vendor_kvm)
+	/*
+	 * R10 is not part of the function call ABI, but it is a part
+	 * of the TDVMCALL ABI. So set it before making call to the
+	 * do_tdx_hypercall().
+	 */
+	movq $TDVMCALL_VENDOR_KVM, %r10
+	call do_tdx_hypercall
+	retq
+SYM_FUNC_END(__tdx_hypercall_vendor_kvm)
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
diff --git a/arch/x86/kernel/tdx-kvm.c b/arch/x86/kernel/tdx-kvm.c
new file mode 100644
index 000000000000..b21453a81e38
--- /dev/null
+++ b/arch/x86/kernel/tdx-kvm.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2020 Intel Corporation */
+
+#include <asm/tdx.h>
+
+static long tdx_kvm_hypercall(unsigned int fn, unsigned long r12,
+			      unsigned long r13, unsigned long r14,
+			      unsigned long r15)
+{
+	return __tdx_hypercall_vendor_kvm(fn, r12, r13, r14, r15, NULL);
+}
+
+/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall0);
+
+/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return tdx_kvm_hypercall(nr, p1, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall1);
+
+/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2)
+{
+	return tdx_kvm_hypercall(nr, p1, p2, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall2);
+
+/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3)
+{
+	return tdx_kvm_hypercall(nr, p1, p2, p3, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall3);
+
+/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3, unsigned long p4)
+{
+	return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-07 21:52   ` Dave Hansen
@ 2021-05-18  0:48     ` Kuppuswamy Sathyanarayanan
  2021-05-18 15:00       ` Dave Hansen
  2021-06-02 19:42     ` [RFC v2-fix-v2 0/2] " Kuppuswamy Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-18  0:48 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

In traditional VMs, MMIO tends to be implemented by giving a
guest access to a mapping which will cause a VMEXIT on access.
That's not possible in TDX guest. So use #VE to implement MMIO
support. In TDX guest, MMIO triggers #VE with EPT_VIOLATION
exit reason.

For now we only handle a subset of instructions that the kernel
uses for MMIO operations. User-space access triggers SIGBUS.

Also, reasons for supporting #VE based MMIO in TDX guest are,

* MMIO is widely used and we'll have more drivers in the future.
* We don't want to annotate every TDX specific MMIO readl/writel etc.
* If we didn't annotate we would need to add an alternative to every
  MMIO access in the kernel (even though 99.9% will never be used on
  TDX) which would be a complete waste and incredible binary bloat
  for nothing.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2:
 * Fixed commit log as per Dave's review.

 arch/x86/kernel/tdx.c | 100 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index b9e3010987e0..9330c7a9ad69 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -5,6 +5,8 @@
 
 #include <asm/tdx.h>
 #include <asm/vmx.h>
+#include <asm/insn.h>
+#include <linux/sched/signal.h> /* force_sig_fault() */
 
 #include <linux/cpu.h>
 #include <linux/protected_guest.h>
@@ -209,6 +211,101 @@ static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
 	}
 }
 
+static unsigned long tdg_mmio(int size, bool write, unsigned long addr,
+		unsigned long val)
+{
+	return tdx_hypercall_out_r11(EXIT_REASON_EPT_VIOLATION, size,
+				     write, addr, val);
+}
+
+static inline void *get_reg_ptr(struct pt_regs *regs, struct insn *insn)
+{
+	static const int regoff[] = {
+		offsetof(struct pt_regs, ax),
+		offsetof(struct pt_regs, cx),
+		offsetof(struct pt_regs, dx),
+		offsetof(struct pt_regs, bx),
+		offsetof(struct pt_regs, sp),
+		offsetof(struct pt_regs, bp),
+		offsetof(struct pt_regs, si),
+		offsetof(struct pt_regs, di),
+		offsetof(struct pt_regs, r8),
+		offsetof(struct pt_regs, r9),
+		offsetof(struct pt_regs, r10),
+		offsetof(struct pt_regs, r11),
+		offsetof(struct pt_regs, r12),
+		offsetof(struct pt_regs, r13),
+		offsetof(struct pt_regs, r14),
+		offsetof(struct pt_regs, r15),
+	};
+	int regno;
+
+	regno = X86_MODRM_REG(insn->modrm.value);
+	if (X86_REX_R(insn->rex_prefix.value))
+		regno += 8;
+
+	return (void *)regs + regoff[regno];
+}
+
+static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+	int size;
+	bool write;
+	unsigned long *reg;
+	struct insn insn;
+	unsigned long val = 0;
+
+	/*
+	 * User mode would mean the kernel exposed a device directly
+	 * to ring3, which shouldn't happen except for things like
+	 * DPDK.
+	 */
+	if (user_mode(regs)) {
+		pr_err("Unexpected user-mode MMIO access.\n");
+		force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);
+		return 0;
+	}
+
+	kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
+	insn_get_length(&insn);
+	insn_get_opcode(&insn);
+
+	write = ve->exit_qual & 0x2;
+
+	size = insn.opnd_bytes;
+	switch (insn.opcode.bytes[0]) {
+	/* MOV r/m8	r8	*/
+	case 0x88:
+	/* MOV r8	r/m8	*/
+	case 0x8A:
+	/* MOV r/m8	imm8	*/
+	case 0xC6:
+		size = 1;
+		break;
+	}
+
+	if (inat_has_immediate(insn.attr)) {
+		BUG_ON(!write);
+		val = insn.immediate.value;
+		tdg_mmio(size, write, ve->gpa, val);
+		return insn.length;
+	}
+
+	BUG_ON(!inat_has_modrm(insn.attr));
+
+	reg = get_reg_ptr(regs, &insn);
+
+	if (write) {
+		memcpy(&val, reg, size);
+		tdg_mmio(size, write, ve->gpa, val);
+	} else {
+		val = tdg_mmio(size, write, ve->gpa, val);
+		memset(reg, 0, size);
+		memcpy(reg, &val, size);
+	}
+	return insn.length;
+}
+
 unsigned long tdg_get_ve_info(struct ve_info *ve)
 {
 	u64 ret;
@@ -258,6 +355,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
 	case EXIT_REASON_IO_INSTRUCTION:
 		tdg_handle_io(regs, ve->exit_qual);
 		break;
+	case EXIT_REASON_EPT_VIOLATION:
+		ve->instr_len = tdg_handle_mmio(regs, ve);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-05-13  2:56   ` Dan Williams
@ 2021-05-18  0:54     ` Kuppuswamy Sathyanarayanan
  2021-05-18  2:06       ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-18  0:54 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson,
	Kai Huang, Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a trampoline for booting APs in 64-bit mode via a software handoff
with BIOS, and use the new trampoline for the ACPI MP wake protocol used
by TDX. You can find MADT MP wake protocol details in ACPI specification
r6.4, sec 5.2.12.19.

Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
mode.  For the GDT pointer, create a new entry as the existing storage
for the pointer occupies the zero entry in the GDT itself.

Reported-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2:
 * Removed X86_CR0_NE and EFER related changes from this changes
   and moved it to patch titled "x86/boot: Avoid #VE during
   boot for TDX platforms"
 * Fixed commit log as per Dan's suggestion.
 * Added inline get_trampoline_start_ip() to set start_ip.

 arch/x86/boot/compressed/pgtable.h       |  2 +-
 arch/x86/include/asm/realmode.h          | 10 +++++++
 arch/x86/kernel/smpboot.c                |  2 +-
 arch/x86/realmode/rm/header.S            |  1 +
 arch/x86/realmode/rm/trampoline_64.S     | 38 ++++++++++++++++++++++++
 arch/x86/realmode/rm/trampoline_common.S |  7 ++++-
 6 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
 
 #define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE	0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE	0x80
 
 #define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
 
diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 5db5d083c873..3328c8edb200 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -25,6 +25,7 @@ struct real_mode_header {
 	u32	sev_es_trampoline_start;
 #endif
 #ifdef CONFIG_X86_64
+	u32	trampoline_start64;
 	u32	trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
@@ -88,6 +89,15 @@ static inline void set_real_mode_mem(phys_addr_t mem)
 	real_mode_header = (struct real_mode_header *) __va(mem);
 }
 
+static inline unsigned long get_trampoline_start_ip(void)
+{
+#ifdef CONFIG_X86_64
+        if (is_tdx_guest())
+                return real_mode_header->trampoline_start64;
+#endif
+	return real_mode_header->trampoline_start;
+}
+
 void reserve_real_mode(void);
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 16703c35a944..0b4dff5e67a9 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1031,7 +1031,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
 		       int *cpu0_nmi_registered)
 {
 	/* start_ip had better be page-aligned! */
-	unsigned long start_ip = real_mode_header->trampoline_start;
+	unsigned long start_ip = get_trampoline_start_ip();
 
 	unsigned long boot_error = 0;
 	unsigned long timeout;
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 8c1db5bf5d78..2eb62be6d256 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
 	.long	pa_sev_es_trampoline_start
 #endif
 #ifdef CONFIG_X86_64
+	.long	pa_trampoline_start64
 	.long	pa_trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 84c5d1b33d10..754f8d2ac9e8 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
 	ljmpl	$__KERNEL_CS, $pa_startup_64
 SYM_CODE_END(startup_32)
 
+SYM_CODE_START(pa_trampoline_compat)
+	/*
+	 * In compatibility mode.  Prep ESP and DX for startup_32, then disable
+	 * paging and complete the switch to legacy 32-bit mode.
+	 */
+	movl	$rm_stack_end, %esp
+	movw	$__KERNEL_DS, %dx
+
+	movl	$(X86_CR0_NE | X86_CR0_PE), %eax
+	movl	%eax, %cr0
+	ljmpl   $__KERNEL32_CS, $pa_startup_32
+SYM_CODE_END(pa_trampoline_compat)
+
 	.section ".text64","ax"
 	.code64
 	.balign 4
@@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
 	jmpq	*tr_start(%rip)
 SYM_CODE_END(startup_64)
 
+SYM_CODE_START(trampoline_start64)
+	/*
+	 * APs start here on a direct transfer from 64-bit BIOS with identity
+	 * mapped page tables.  Load the kernel's GDT in order to gear down to
+	 * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
+	 * segment registers.  Load the zero IDT so any fault triggers a
+	 * shutdown instead of jumping back into BIOS.
+	 */
+	lidt	tr_idt(%rip)
+	lgdt	tr_gdt64(%rip)
+
+	ljmpl	*tr_compat(%rip)
+SYM_CODE_END(trampoline_start64)
+
 	.section ".rodata","a"
 	# Duplicate the global descriptor table
 	# so the kernel can live anywhere
@@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
 	.quad	0x00cf93000000ffff	# __KERNEL_DS
 SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
 
+SYM_DATA_START(tr_gdt64)
+	.short	tr_gdt_end - tr_gdt - 1	# gdt limit
+	.long	pa_tr_gdt
+	.long	0
+SYM_DATA_END(tr_gdt64)
+
+SYM_DATA_START(tr_compat)
+	.long	pa_trampoline_compat
+	.short	__KERNEL32_CS
+SYM_DATA_END(tr_compat)
+
 	.bss
 	.balign	PAGE_SIZE
 SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
index 5033e640f957..ade7db208e4e 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -1,4 +1,9 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 	.section ".rodata","a"
 	.balign	16
-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+
+/* .fill cannot be used for size > 8. So use short and quad */
+SYM_DATA_START_LOCAL(tr_idt)
+	.short  0
+	.quad   0
+SYM_DATA_END(tr_idt)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [WARNING: UNSCANNABLE EXTRACTION FAILED][WARNING: UNSCANNABLE EXTRACTION FAILED][RFC v2-fix 1/1] x86/boot: Avoid #VE during boot for TDX platforms
  2021-05-13  3:23   ` Dan Williams
@ 2021-05-18  0:59     ` Kuppuswamy Sathyanarayanan
  2021-05-19 16:53       ` [RFC " Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-18  0:59 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson,
	Kuppuswamy Sathyanarayanan

From: Sean Christopherson <sean.j.christopherson@intel.com>

Avoid operations which will inject #VE during boot process,
which is obviously fatal for TDX platforms.

Details are,

1. TDX module injects #VE if a TDX guest attempts to write
   EFER.
   
   Boot code updates EFER in following cases:
   
   * When enabling Long Mode configuration, EFER.LME bit will
     be set. Since TDX forces EFER.LME=1, we can skip updating
     it again. Check for EFER.LME before updating it and skip
     it if it is already set.

   * EFER is also updated to enable support for features like
     System call and No Execute page setting. In TDX, these
     features are set up by the TDX module. So check whether
     it is already enabled, and skip enabling it again.
   
2. TDX module also injects a #VE if the guest attempts to clear
   CR0.NE. Ensure CR0.NE is set when loading CR0 during compressed
   boot. The Setting CR0.NE should be a nop on all CPUs that
   support 64-bit mode.
   
3. The TDX-Module (effectively part of the hypervisor) requires
   CR4.MCE to be set at all times and injects a #VE if the guest
   attempts to clear CR4.MCE. So, preserve CR4.MCE instead of
   clearing it during boot to avoid #VE.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2:
 * Merged Avoid #VE related changes together.
   * [RFC v2 22/32] x86/boot: Avoid #VE during compressed boot
     for TDX platforms
   * [RFC v2 23/32] x86/boot: Avoid unnecessary #VE during boot process.
 * Fixed commit log as per review comments.

 arch/x86/boot/compressed/head_64.S   | 10 +++++++---
 arch/x86/kernel/head_64.S            | 13 +++++++++++--
 arch/x86/realmode/rm/trampoline_64.S | 11 +++++++++--
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index e94874f4bbc1..2d79e5f97360 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -616,12 +616,16 @@ SYM_CODE_START(trampoline_32bit_src)
 	movl	$MSR_EFER, %ecx
 	rdmsr
 	btsl	$_EFER_LME, %eax
+	jc	1f
 	wrmsr
-	popl	%edx
+1:	popl	%edx
 	popl	%ecx
 
 	/* Enable PAE and LA57 (if required) paging modes */
-	movl	$X86_CR4_PAE, %eax
+	movl	%cr4, %eax
+	/* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
+	andl	$X86_CR4_MCE, %eax
+	orl	$X86_CR4_PAE, %eax
 	testl	%edx, %edx
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
@@ -636,7 +640,7 @@ SYM_CODE_START(trampoline_32bit_src)
 	pushl	%eax
 
 	/* Enable paging again */
-	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
+	movl	$(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	lret
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..92c77cf75542 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 1:
 
 	/* Enable PAE mode, PGE and LA57 */
-	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+	movq	%cr4, %rcx
+	/* Clearing CR4.MCE will #VE on TDX guests.  Leave it alone. */
+	andl	$X86_CR4_MCE, %ecx
+	orl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
 #ifdef CONFIG_X86_5LEVEL
 	testl	$1, __pgtable_l5_enabled(%rip)
 	jz	1f
@@ -229,13 +232,19 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	/* Setup EFER (Extended Feature Enable Register) */
 	movl	$MSR_EFER, %ecx
 	rdmsr
+	movl    %eax, %edx
 	btsl	$_EFER_SCE, %eax	/* Enable System Call */
 	btl	$20,%edi		/* No Execute supported? */
 	jnc     1f
 	btsl	$_EFER_NX, %eax
 	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
-1:	wrmsr				/* Make changes effective */
 
+	/* Skip the WRMSR if the current value matches the desired value. */
+1:	cmpl	%edx, %eax
+	je	1f
+	xor	%edx, %edx
+	wrmsr				/* Make changes effective */
+1:
 	/* Setup cr0 */
 	movl	$CR0_STATE, %eax
 	/* Make changes effective */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 754f8d2ac9e8..12b734b1da8b 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,20 @@ SYM_CODE_START(startup_32)
 	movl	%eax, %cr3
 
 	# Set up EFER
+	movl	$MSR_EFER, %ecx
+	rdmsr
+	cmp	pa_tr_efer, %eax
+	jne	.Lwrite_efer
+	cmp	pa_tr_efer + 4, %edx
+	je	.Ldone_efer
+.Lwrite_efer:
 	movl	pa_tr_efer, %eax
 	movl	pa_tr_efer + 4, %edx
-	movl	$MSR_EFER, %ecx
 	wrmsr
 
+.Ldone_efer:
 	# Enable paging and in turn activate Long Mode
-	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	/*
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix 1/1] x86/tdx: Make DMA pages shared
  2021-04-26 18:01 ` [RFC v2 30/32] x86/tdx: Make DMA pages shared Kuppuswamy Sathyanarayanan
@ 2021-05-18  1:19   ` Kuppuswamy Sathyanarayanan
  2021-05-18 19:55     ` Sean Christopherson
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-18  1:19 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kai Huang,
	Sean Christopherson, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Intel TDX doesn't allow VMM to access guest memory. Any memory
that is required for communication with VMM must be shared
explicitly by setting the bit in page table entry. And, after
setting the shared bit, the conversion must be completed with
MapGPA TDVMALL. The call informs VMM about the conversion and
makes it remove the GPA from the S-EPT mapping. The shared
memory is similar to unencrypted memory in AMD SME/SEV terminology
but the underlying process of sharing/un-sharing the memory is
different for Intel TDX guest platform.

SEV assumes that I/O devices can only do DMA to "decrypted"
physical addresses without the C-bit set.  In order for the CPU
to interact with this memory, the CPU needs a decrypted mapping.
To add this support, AMD SME code forces force_dma_unencrypted()
to return true for platforms that support AMD SEV feature. It will
be used for DMA memory allocation API to trigger
set_memory_decrypted() for platforms that support AMD SEV feature.

TDX is similar.  TDX architecturally prevents access to private
guest memory by anything other than the guest itself. This means
that any DMA buffers must be shared.

So create a new file mem_encrypt_tdx.c to hold TDX specific memory
initialization code, and re-define force_dma_unencrypted() for
TDX guest and make it return true to get DMA pages mapped as shared.

__set_memory_enc_dec() is now aware about TDX and sets Shared bit
accordingly following with relevant TDVMCALL.

Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when
converting memory to private.  If the VMM uses a common pool for private
and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA
(or on the first access to the private GPA), in which case TDX-Module will
hold the page in a non-present "pending" state until it is explicitly
accepted.

BUG() if TDACCEPTPAGE fails (except the above case), as the guest is
completely hosed if it can't access memory. 

Tested-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2:
 * Since the common code between AMD-SEV and TDX is very minimal,
   defining a new config (X86_MEM_ENCRYPT_COMMON) for common code
   is not very useful. So createed a seperate file for Intel TDX
   specific memory initialization (similar to AMD SEV).
 * Removed patch titled "x86/mm: Move force_dma_unencrypted() to
   common code" from this series. And merged required changes in
   this patch.

 arch/x86/Kconfig              |  1 +
 arch/x86/include/asm/tdx.h    |  3 +++
 arch/x86/kernel/tdx.c         | 26 ++++++++++++++++++-
 arch/x86/mm/Makefile          |  1 +
 arch/x86/mm/mem_encrypt_tdx.c | 19 ++++++++++++++
 arch/x86/mm/pat/set_memory.c  | 48 +++++++++++++++++++++++++++++------
 6 files changed, 89 insertions(+), 9 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_tdx.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a055594e2664..69a98bcdc07a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -879,6 +879,7 @@ config INTEL_TDX_GUEST
 	select X86_X2APIC
 	select SECURITY_LOCKDOWN_LSM
 	select ARCH_HAS_PROTECTED_GUEST
+	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
 	select DYNAMIC_PHYSICAL_MASK
 	help
 	  Provide support for running in a trusted domain on Intel processors
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index f5e8088dabc5..4ad436cc2146 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -19,6 +19,9 @@ enum tdx_map_type {
 
 #define TDINFO			1
 #define TDGETVEINFO		3
+#define TDACCEPTPAGE		6
+
+#define TDX_PAGE_ALREADY_ACCEPTED	0x8000000000000001
 
 struct tdx_module_output {
 	u64 rcx;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 9ddb80adc034..caf8e4c5ddbc 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -100,7 +100,8 @@ static void tdg_get_info(void)
 	physical_mask &= ~tdg_shared_mask();
 }
 
-int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+static int __tdg_map_gpa(phys_addr_t gpa, int numpages,
+			 enum tdx_map_type map_type)
 {
 	u64 ret;
 
@@ -111,6 +112,29 @@ int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
 	return ret ? -EIO : 0;
 }
 
+static void tdg_accept_page(phys_addr_t gpa)
+{
+	u64 ret;
+
+	ret = __tdx_module_call(TDACCEPTPAGE, gpa, 0, 0, 0, NULL);
+
+	BUG_ON(ret && ret != TDX_PAGE_ALREADY_ACCEPTED);
+}
+
+int tdg_map_gpa(phys_addr_t gpa, int numpages, enum tdx_map_type map_type)
+{
+	int ret, i;
+
+	ret = __tdg_map_gpa(gpa, numpages, map_type);
+	if (ret || map_type == TDX_MAP_SHARED)
+		return ret;
+
+	for (i = 0; i < numpages; i++)
+		tdg_accept_page(gpa + i*PAGE_SIZE);
+
+	return 0;
+}
+
 static __cpuidle void tdg_halt(void)
 {
 	u64 ret;
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..555dcc0cd087 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -55,3 +55,4 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= mem_encrypt_tdx.o
diff --git a/arch/x86/mm/mem_encrypt_tdx.c b/arch/x86/mm/mem_encrypt_tdx.c
new file mode 100644
index 000000000000..f394a43bf46d
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_tdx.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Intel TDX Memory Encryption Support
+ *
+ * Copyright (C) 2020 Intel Corporation
+ *
+ * Author: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/dma-mapping.h>
+
+#include <asm/tdx.h>
+
+/* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
+bool force_dma_unencrypted(struct device *dev)
+{
+	return is_tdx_guest();
+}
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..ea78c7907847 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -27,6 +27,7 @@
 #include <asm/proto.h>
 #include <asm/memtype.h>
 #include <asm/set_memory.h>
+#include <asm/tdx.h>
 
 #include "../mm_internal.h"
 
@@ -1972,13 +1973,15 @@ int set_memory_global(unsigned long addr, int numpages)
 				    __pgprot(_PAGE_GLOBAL), 0);
 }
 
-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
+static int __set_memory_protect(unsigned long addr, int numpages, bool protect)
 {
+	pgprot_t mem_protected_bits, mem_plain_bits;
 	struct cpa_data cpa;
+	enum tdx_map_type map_type;
 	int ret;
 
-	/* Nothing to do if memory encryption is not active */
-	if (!mem_encrypt_active())
+	/* Nothing to do if memory encryption and TDX are not active */
+	if (!mem_encrypt_active() && !is_tdx_guest())
 		return 0;
 
 	/* Should not be working on unaligned addresses */
@@ -1988,8 +1991,25 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	memset(&cpa, 0, sizeof(cpa));
 	cpa.vaddr = &addr;
 	cpa.numpages = numpages;
-	cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
-	cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+
+	if (is_tdx_guest()) {
+		mem_protected_bits = __pgprot(0);
+		mem_plain_bits = __pgprot(tdg_shared_mask());
+	} else {
+		mem_protected_bits = __pgprot(_PAGE_ENC);
+		mem_plain_bits = __pgprot(0);
+	}
+
+	if (protect) {
+		cpa.mask_set = mem_protected_bits;
+		cpa.mask_clr = mem_plain_bits;
+		map_type = TDX_MAP_PRIVATE;
+	} else {
+		cpa.mask_set = mem_plain_bits;
+		cpa.mask_clr = mem_protected_bits;
+		map_type = TDX_MAP_SHARED;
+	}
+
 	cpa.pgd = init_mm.pgd;
 
 	/* Must avoid aliasing mappings in the highmem code */
@@ -1998,8 +2018,16 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 
 	/*
 	 * Before changing the encryption attribute, we need to flush caches.
+	 *
+	 * For TDX we need to flush caches on private->shared. VMM is
+	 * responsible for flushing on shared->private.
 	 */
-	cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+	if (is_tdx_guest()) {
+		if (map_type == TDX_MAP_SHARED)
+			cpa_flush(&cpa, 1);
+	} else {
+		cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+	}
 
 	ret = __change_page_attr_set_clr(&cpa, 1);
 
@@ -2012,18 +2040,22 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	 */
 	cpa_flush(&cpa, 0);
 
+	if (!ret && is_tdx_guest()) {
+		ret = tdg_map_gpa(__pa(addr), numpages, map_type);
+	}
+
 	return ret;
 }
 
 int set_memory_encrypted(unsigned long addr, int numpages)
 {
-	return __set_memory_enc_dec(addr, numpages, true);
+	return __set_memory_protect(addr, numpages, true);
 }
 EXPORT_SYMBOL_GPL(set_memory_encrypted);
 
 int set_memory_decrypted(unsigned long addr, int numpages)
 {
-	return __set_memory_enc_dec(addr, numpages, false);
+	return __set_memory_protect(addr, numpages, false);
 }
 EXPORT_SYMBOL_GPL(set_memory_decrypted);
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 26/32] x86/mm: Move force_dma_unencrypted() to common code
  2021-05-12 15:44           ` Dave Hansen
  2021-05-12 15:53             ` Sean Christopherson
@ 2021-05-18  1:28             ` Kuppuswamy, Sathyanarayanan
  2021-05-27  4:46               ` Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-18  1:28 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov
  Cc: Peter Zijlstra, Andy Lutomirski, Dan Williams, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel



On 5/12/21 8:44 AM, Dave Hansen wrote:
> Because the code is already separate.  You're actually going to some
> trouble to move the SEV-specific code and then combine it with the
> TDX-specific code.
> 
> Anyway, please just give it a shot.  Should take all of ten minutes.  If
> it doesn't work out in practice, fine.  You'll have a good paragraph for
> the changelog.

After reviewing the code again, I have noticed that we don't really have
much common code between AMD and TDX. So I don't see any justification for
creating this common layer. So, I have decided to drop this patch and move
Intel TDX specific memory encryption init code to patch titled "[RFC v2 30/32]
x86/tdx: Make DMA pages shared". This model is similar to how AMD-SEV
does the initialization.

I have sent the modified patch as reply to patch titled "[RFC v2 30/32]
x86/tdx: Make DMA pages shared". Please check and let me know your comments.
-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-05-18  0:54     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-05-18  2:06       ` Dan Williams
  2021-05-18  2:53         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-18  2:06 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Sean Christopherson, Kai Huang

I notice that you have [RFC v2-fix 1/1] as the prefix for this patch.
b4 recently gained support for partial series re-rolls [1], but I
think you would need to bump the version number [RFC PATCH v3 21/32]
and maintain the patch numbering. In this case with changes moving
between patches, and those other patches being squashed any chance of
automated reconstruction of this series is likely lost.

Just wanted to note that for future reference in case you were hoping
to avoid resending full series in the future. For now, some more
comments below:

[1]: https://lore.kernel.org/tools/20210517161317.teawoh5qovxpmqdc@nitro.local/

On Mon, May 17, 2021 at 5:54 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Add a trampoline for booting APs in 64-bit mode via a software handoff
> with BIOS, and use the new trampoline for the ACPI MP wake protocol used
> by TDX. You can find MADT MP wake protocol details in ACPI specification
> r6.4, sec 5.2.12.19.
>
> Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
> mode.  For the GDT pointer, create a new entry as the existing storage
> for the pointer occupies the zero entry in the GDT itself.
>
> Reported-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>
> Changes since RFC v2:
>  * Removed X86_CR0_NE and EFER related changes from this changes

This was only partially done, see below...

>    and moved it to patch titled "x86/boot: Avoid #VE during
>    boot for TDX platforms"
>  * Fixed commit log as per Dan's suggestion.
>  * Added inline get_trampoline_start_ip() to set start_ip.

You also added a comment to tr_idt, but didn't mention it here, so I
went to double check. Please take care to document all changes to the
patch from the previous review.

>
>  arch/x86/boot/compressed/pgtable.h       |  2 +-
>  arch/x86/include/asm/realmode.h          | 10 +++++++
>  arch/x86/kernel/smpboot.c                |  2 +-
>  arch/x86/realmode/rm/header.S            |  1 +
>  arch/x86/realmode/rm/trampoline_64.S     | 38 ++++++++++++++++++++++++
>  arch/x86/realmode/rm/trampoline_common.S |  7 ++++-
>  6 files changed, 57 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
> index 6ff7e81b5628..cc9b2529a086 100644
> --- a/arch/x86/boot/compressed/pgtable.h
> +++ b/arch/x86/boot/compressed/pgtable.h
> @@ -6,7 +6,7 @@
>  #define TRAMPOLINE_32BIT_PGTABLE_OFFSET        0
>
>  #define TRAMPOLINE_32BIT_CODE_OFFSET   PAGE_SIZE
> -#define TRAMPOLINE_32BIT_CODE_SIZE     0x70
> +#define TRAMPOLINE_32BIT_CODE_SIZE     0x80
>
>  #define TRAMPOLINE_32BIT_STACK_END     TRAMPOLINE_32BIT_SIZE
>
> diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
> index 5db5d083c873..3328c8edb200 100644
> --- a/arch/x86/include/asm/realmode.h
> +++ b/arch/x86/include/asm/realmode.h
> @@ -25,6 +25,7 @@ struct real_mode_header {
>         u32     sev_es_trampoline_start;
>  #endif
>  #ifdef CONFIG_X86_64
> +       u32     trampoline_start64;
>         u32     trampoline_pgd;
>  #endif
>         /* ACPI S3 wakeup */
> @@ -88,6 +89,15 @@ static inline void set_real_mode_mem(phys_addr_t mem)
>         real_mode_header = (struct real_mode_header *) __va(mem);
>  }
>
> +static inline unsigned long get_trampoline_start_ip(void)

I'd prefer this helper take a 'struct real_mode_header *rmh' as an
argument rather than assume a global variable.

> +{
> +#ifdef CONFIG_X86_64
> +        if (is_tdx_guest())
> +                return real_mode_header->trampoline_start64;
> +#endif
> +       return real_mode_header->trampoline_start;
> +}
> +
>  void reserve_real_mode(void);
>
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 16703c35a944..0b4dff5e67a9 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1031,7 +1031,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
>                        int *cpu0_nmi_registered)
>  {
>         /* start_ip had better be page-aligned! */
> -       unsigned long start_ip = real_mode_header->trampoline_start;
> +       unsigned long start_ip = get_trampoline_start_ip();
>
>         unsigned long boot_error = 0;
>         unsigned long timeout;
> diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
> index 8c1db5bf5d78..2eb62be6d256 100644
> --- a/arch/x86/realmode/rm/header.S
> +++ b/arch/x86/realmode/rm/header.S
> @@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
>         .long   pa_sev_es_trampoline_start
>  #endif
>  #ifdef CONFIG_X86_64
> +       .long   pa_trampoline_start64
>         .long   pa_trampoline_pgd;
>  #endif
>         /* ACPI S3 wakeup */
> diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
> index 84c5d1b33d10..754f8d2ac9e8 100644
> --- a/arch/x86/realmode/rm/trampoline_64.S
> +++ b/arch/x86/realmode/rm/trampoline_64.S
> @@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
>         ljmpl   $__KERNEL_CS, $pa_startup_64
>  SYM_CODE_END(startup_32)
>
> +SYM_CODE_START(pa_trampoline_compat)
> +       /*
> +        * In compatibility mode.  Prep ESP and DX for startup_32, then disable
> +        * paging and complete the switch to legacy 32-bit mode.
> +        */
> +       movl    $rm_stack_end, %esp
> +       movw    $__KERNEL_DS, %dx
> +
> +       movl    $(X86_CR0_NE | X86_CR0_PE), %eax

Before this patch the startup path did not touch X86_CR0_NE. I assume
it was added opportunistically for the TDX case? If it is to stay in
this patch it deserves a code comment / mention in the changelog, or
it needs to move to the other patch that fixes up the CR0 setup for
TDX.


> +       movl    %eax, %cr0
> +       ljmpl   $__KERNEL32_CS, $pa_startup_32
> +SYM_CODE_END(pa_trampoline_compat)
> +
>         .section ".text64","ax"
>         .code64
>         .balign 4
> @@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
>         jmpq    *tr_start(%rip)
>  SYM_CODE_END(startup_64)
>
> +SYM_CODE_START(trampoline_start64)
> +       /*
> +        * APs start here on a direct transfer from 64-bit BIOS with identity
> +        * mapped page tables.  Load the kernel's GDT in order to gear down to
> +        * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
> +        * segment registers.  Load the zero IDT so any fault triggers a
> +        * shutdown instead of jumping back into BIOS.
> +        */
> +       lidt    tr_idt(%rip)
> +       lgdt    tr_gdt64(%rip)
> +
> +       ljmpl   *tr_compat(%rip)
> +SYM_CODE_END(trampoline_start64)
> +
>         .section ".rodata","a"
>         # Duplicate the global descriptor table
>         # so the kernel can live anywhere
> @@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
>         .quad   0x00cf93000000ffff      # __KERNEL_DS
>  SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
>
> +SYM_DATA_START(tr_gdt64)
> +       .short  tr_gdt_end - tr_gdt - 1 # gdt limit
> +       .long   pa_tr_gdt
> +       .long   0
> +SYM_DATA_END(tr_gdt64)
> +
> +SYM_DATA_START(tr_compat)
> +       .long   pa_trampoline_compat
> +       .short  __KERNEL32_CS
> +SYM_DATA_END(tr_compat)
> +
>         .bss
>         .balign PAGE_SIZE
>  SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
> diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
> index 5033e640f957..ade7db208e4e 100644
> --- a/arch/x86/realmode/rm/trampoline_common.S
> +++ b/arch/x86/realmode/rm/trampoline_common.S
> @@ -1,4 +1,9 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>         .section ".rodata","a"
>         .balign 16
> -SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
> +
> +/* .fill cannot be used for size > 8. So use short and quad */

If there is to be a comment here it should be to clarify why @tr_idt
is 10 bytes, not necessarily a quirk of the assembler.

> +SYM_DATA_START_LOCAL(tr_idt)

The .fill restriction is only for @size, not @repeat. So, what's wrong
with SYM_DATA_LOCAL(tr_idt, .fill 2, 5, 0)?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-05-18  2:06       ` Dan Williams
@ 2021-05-18  2:53         ` Kuppuswamy, Sathyanarayanan
  2021-05-18  4:08           ` Dan Williams
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-18  2:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Sean Christopherson, Kai Huang



On 5/17/21 7:06 PM, Dan Williams wrote:
> I notice that you have [RFC v2-fix 1/1] as the prefix for this patch.
> b4 recently gained support for partial series re-rolls [1], but I
> think you would need to bump the version number [RFC PATCH v3 21/32]
> and maintain the patch numbering. In this case with changes moving
> between patches, and those other patches being squashed any chance of
> automated reconstruction of this series is likely lost.

Ok. I will make sure to bump the version in next partial re-roll.

If I am fixing this patch as per your comments, do I need bump the
patch version for it as well?

> 
> Just wanted to note that for future reference in case you were hoping
> to avoid resending full series in the future. For now, some more
> comments below:

Thanks.

> 
> [1]: https://lore.kernel.org/tools/20210517161317.teawoh5qovxpmqdc@nitro.local/
> 
> On Mon, May 17, 2021 at 5:54 PM Kuppuswamy Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>>
>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>
>> Add a trampoline for booting APs in 64-bit mode via a software handoff
>> with BIOS, and use the new trampoline for the ACPI MP wake protocol used
>> by TDX. You can find MADT MP wake protocol details in ACPI specification
>> r6.4, sec 5.2.12.19.
>>
>> Extend the real mode IDT pointer by four bytes to support LIDT in 64-bit
>> mode.  For the GDT pointer, create a new entry as the existing storage
>> for the pointer occupies the zero entry in the GDT itself.
>>
>> Reported-by: Kai Huang <kai.huang@intel.com>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> ---
>>
>> Changes since RFC v2:
>>   * Removed X86_CR0_NE and EFER related changes from this changes
> 
> This was only partially done, see below...
> 
>>     and moved it to patch titled "x86/boot: Avoid #VE during
>>     boot for TDX platforms"
>>   * Fixed commit log as per Dan's suggestion.
>>   * Added inline get_trampoline_start_ip() to set start_ip.
> 
> You also added a comment to tr_idt, but didn't mention it here, so I
> went to double check. Please take care to document all changes to the
> patch from the previous review.

Ok. I will make sure change log is current.

> 
>>
>>   arch/x86/boot/compressed/pgtable.h       |  2 +-
>>   arch/x86/include/asm/realmode.h          | 10 +++++++
>>   arch/x86/kernel/smpboot.c                |  2 +-
>>   arch/x86/realmode/rm/header.S            |  1 +
>>   arch/x86/realmode/rm/trampoline_64.S     | 38 ++++++++++++++++++++++++
>>   arch/x86/realmode/rm/trampoline_common.S |  7 ++++-
>>   6 files changed, 57 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
>> index 6ff7e81b5628..cc9b2529a086 100644
>> --- a/arch/x86/boot/compressed/pgtable.h
>> +++ b/arch/x86/boot/compressed/pgtable.h
>> @@ -6,7 +6,7 @@
>>   #define TRAMPOLINE_32BIT_PGTABLE_OFFSET        0
>>
>>   #define TRAMPOLINE_32BIT_CODE_OFFSET   PAGE_SIZE
>> -#define TRAMPOLINE_32BIT_CODE_SIZE     0x70
>> +#define TRAMPOLINE_32BIT_CODE_SIZE     0x80
>>
>>   #define TRAMPOLINE_32BIT_STACK_END     TRAMPOLINE_32BIT_SIZE
>>
>> diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
>> index 5db5d083c873..3328c8edb200 100644
>> --- a/arch/x86/include/asm/realmode.h
>> +++ b/arch/x86/include/asm/realmode.h
>> @@ -25,6 +25,7 @@ struct real_mode_header {
>>          u32     sev_es_trampoline_start;
>>   #endif
>>   #ifdef CONFIG_X86_64
>> +       u32     trampoline_start64;
>>          u32     trampoline_pgd;
>>   #endif
>>          /* ACPI S3 wakeup */
>> @@ -88,6 +89,15 @@ static inline void set_real_mode_mem(phys_addr_t mem)
>>          real_mode_header = (struct real_mode_header *) __va(mem);
>>   }
>>
>> +static inline unsigned long get_trampoline_start_ip(void)
> 
> I'd prefer this helper take a 'struct real_mode_header *rmh' as an
> argument rather than assume a global variable.

I am fine with it. But existing inline functions also directly read/writes
the real_mode_header. So I just followed the same format.

I will fix this in next version.

> 
>> +{
>> +#ifdef CONFIG_X86_64
>> +        if (is_tdx_guest())
>> +                return real_mode_header->trampoline_start64;
>> +#endif
>> +       return real_mode_header->trampoline_start;
>> +}
>> +
>>   void reserve_real_mode(void);
>>
>>   #endif /* __ASSEMBLY__ */
>> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
>> index 16703c35a944..0b4dff5e67a9 100644
>> --- a/arch/x86/kernel/smpboot.c
>> +++ b/arch/x86/kernel/smpboot.c
>> @@ -1031,7 +1031,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
>>                         int *cpu0_nmi_registered)
>>   {
>>          /* start_ip had better be page-aligned! */
>> -       unsigned long start_ip = real_mode_header->trampoline_start;
>> +       unsigned long start_ip = get_trampoline_start_ip();
>>
>>          unsigned long boot_error = 0;
>>          unsigned long timeout;
>> diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
>> index 8c1db5bf5d78..2eb62be6d256 100644
>> --- a/arch/x86/realmode/rm/header.S
>> +++ b/arch/x86/realmode/rm/header.S
>> @@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
>>          .long   pa_sev_es_trampoline_start
>>   #endif
>>   #ifdef CONFIG_X86_64
>> +       .long   pa_trampoline_start64
>>          .long   pa_trampoline_pgd;
>>   #endif
>>          /* ACPI S3 wakeup */
>> diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
>> index 84c5d1b33d10..754f8d2ac9e8 100644
>> --- a/arch/x86/realmode/rm/trampoline_64.S
>> +++ b/arch/x86/realmode/rm/trampoline_64.S
>> @@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
>>          ljmpl   $__KERNEL_CS, $pa_startup_64
>>   SYM_CODE_END(startup_32)
>>
>> +SYM_CODE_START(pa_trampoline_compat)
>> +       /*
>> +        * In compatibility mode.  Prep ESP and DX for startup_32, then disable
>> +        * paging and complete the switch to legacy 32-bit mode.
>> +        */
>> +       movl    $rm_stack_end, %esp
>> +       movw    $__KERNEL_DS, %dx
>> +
>> +       movl    $(X86_CR0_NE | X86_CR0_PE), %eax
> 
> Before this patch the startup path did not touch X86_CR0_NE. I assume
> it was added opportunistically for the TDX case? If it is to stay in
> this patch it deserves a code comment / mention in the changelog, or
> it needs to move to the other patch that fixes up the CR0 setup for
> TDX.

I will move X86_CR0_NE related update to the patch that has other
X86_CR0_NE related updates.

> 
> 
>> +       movl    %eax, %cr0
>> +       ljmpl   $__KERNEL32_CS, $pa_startup_32
>> +SYM_CODE_END(pa_trampoline_compat)
>> +
>>          .section ".text64","ax"
>>          .code64
>>          .balign 4
>> @@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
>>          jmpq    *tr_start(%rip)
>>   SYM_CODE_END(startup_64)
>>
>> +SYM_CODE_START(trampoline_start64)
>> +       /*
>> +        * APs start here on a direct transfer from 64-bit BIOS with identity
>> +        * mapped page tables.  Load the kernel's GDT in order to gear down to
>> +        * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
>> +        * segment registers.  Load the zero IDT so any fault triggers a
>> +        * shutdown instead of jumping back into BIOS.
>> +        */
>> +       lidt    tr_idt(%rip)
>> +       lgdt    tr_gdt64(%rip)
>> +
>> +       ljmpl   *tr_compat(%rip)
>> +SYM_CODE_END(trampoline_start64)
>> +
>>          .section ".rodata","a"
>>          # Duplicate the global descriptor table
>>          # so the kernel can live anywhere
>> @@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
>>          .quad   0x00cf93000000ffff      # __KERNEL_DS
>>   SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
>>
>> +SYM_DATA_START(tr_gdt64)
>> +       .short  tr_gdt_end - tr_gdt - 1 # gdt limit
>> +       .long   pa_tr_gdt
>> +       .long   0
>> +SYM_DATA_END(tr_gdt64)
>> +
>> +SYM_DATA_START(tr_compat)
>> +       .long   pa_trampoline_compat
>> +       .short  __KERNEL32_CS
>> +SYM_DATA_END(tr_compat)
>> +
>>          .bss
>>          .balign PAGE_SIZE
>>   SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
>> diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
>> index 5033e640f957..ade7db208e4e 100644
>> --- a/arch/x86/realmode/rm/trampoline_common.S
>> +++ b/arch/x86/realmode/rm/trampoline_common.S
>> @@ -1,4 +1,9 @@
>>   /* SPDX-License-Identifier: GPL-2.0 */
>>          .section ".rodata","a"
>>          .balign 16
>> -SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
>> +
>> +/* .fill cannot be used for size > 8. So use short and quad */
> 
> If there is to be a comment here it should be to clarify why @tr_idt
> is 10 bytes, not necessarily a quirk of the assembler.

Got it. I will fix the comment or remove it.

> 
>> +SYM_DATA_START_LOCAL(tr_idt)
> 
> The .fill restriction is only for @size, not @repeat. So, what's wrong
> with SYM_DATA_LOCAL(tr_idt, .fill 2, 5, 0)?

Any reason to prefer above change over previous code ?

SYM_DATA_START_LOCAL(tr_idt)
         .short  0
         .quad   0
SYM_DATA_END(tr_idt)

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/boot: Add a trampoline for APs booting in 64-bit mode
  2021-05-18  2:53         ` Kuppuswamy, Sathyanarayanan
@ 2021-05-18  4:08           ` Dan Williams
  2021-05-20  0:18             ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dan Williams @ 2021-05-18  4:08 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, Linux Kernel Mailing List,
	Sean Christopherson, Kai Huang

On Mon, May 17, 2021 at 7:53 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
>
>
> On 5/17/21 7:06 PM, Dan Williams wrote:
> > I notice that you have [RFC v2-fix 1/1] as the prefix for this patch.
> > b4 recently gained support for partial series re-rolls [1], but I
> > think you would need to bump the version number [RFC PATCH v3 21/32]
> > and maintain the patch numbering. In this case with changes moving
> > between patches, and those other patches being squashed any chance of
> > automated reconstruction of this series is likely lost.
>
> Ok. I will make sure to bump the version in next partial re-roll.
>
> If I am fixing this patch as per your comments, do I need bump the
> patch version for it as well?

I don't think it matters too much in this case as I don't think I can
use b4 to assemble this series. So just for future reference on other
patch sets. That said, I wouldn't mind a link to your work-in-progress
branch to see all the changes together in one place.

[..]
> > I'd prefer this helper take a 'struct real_mode_header *rmh' as an
> > argument rather than assume a global variable.
>
> I am fine with it. But existing inline functions also directly read/writes
> the real_mode_header. So I just followed the same format.

I notice the SEV-ES code passes an @rmh variable around for this purpose.

[..]
> > If there is to be a comment here it should be to clarify why @tr_idt
> > is 10 bytes, not necessarily a quirk of the assembler.
>
> Got it. I will fix the comment or remove it.
>
> >
> >> +SYM_DATA_START_LOCAL(tr_idt)
> >
> > The .fill restriction is only for @size, not @repeat. So, what's wrong
> > with SYM_DATA_LOCAL(tr_idt, .fill 2, 5, 0)?
>
> Any reason to prefer above change over previous code ?

What I'm really after is capturing why this size needs to be adjusted
for future reference. Maybe it's plainly obvious to someone who has
worked with this code, but it was not immediately obvious to me.

>
> SYM_DATA_START_LOCAL(tr_idt)
>          .short  0
>          .quad   0
> SYM_DATA_END(tr_idt)

This format implies that tr_idt is reserving space for 2 distinct data
structure attributes of those sizes, can you just put those names here
as comments? Otherwise the .fill format is more compact.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18  0:48     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-05-18 15:00       ` Dave Hansen
  2021-05-18 15:56         ` Andi Kleen
  2021-05-18 16:18         ` Sean Christopherson
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 15:00 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

On 5/17/21 5:48 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> In traditional VMs, MMIO tends to be implemented by giving a
> guest access to a mapping which will cause a VMEXIT on access.
> That's not possible in TDX guest.

Why is it not possible?

> So use #VE to implement MMIO support. In TDX guest, MMIO triggers #VE
> with EPT_VIOLATION exit reason.

What does the #VE handler do to resolve the exception?

> For now we only handle a subset of instructions that the kernel
> uses for MMIO operations. User-space access triggers SIGBUS.

How do you know which instructions the kernel uses?  How do you know
that the compiler won't change them?

I guess the kernel won't boot far if this happens, but this still sounds
like trial-and-error programming.

> Also, reasons for supporting #VE based MMIO in TDX guest are,
> 
> * MMIO is widely used and we'll have more drivers in the future.

OK, but you've also made a big deal about having to go explicitly audit
these drivers.  I would imagine converting these over to stop using MMIO
would be _relatively_ minor compared to a big security audit and new
fuzzing infrastructure.

> * We don't want to annotate every TDX specific MMIO readl/writel etc.

				    ^ TDX-specific

> * If we didn't annotate we would need to add an alternative to every
>   MMIO access in the kernel (even though 99.9% will never be used on
>   TDX) which would be a complete waste and incredible binary bloat
>   for nothing.

That sounds like something objective we can measure.  Does this cost 1
byte of extra text per readl/writel?  10?  100?

You're also being rather indirect about what solutions you ruled out.
Why not just say: we considered doing ____, but ruled that out because
it would have required ____.  Above you just tell us what the solution
required without mentioning the solution.

> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index b9e3010987e0..9330c7a9ad69 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -5,6 +5,8 @@
>  
>  #include <asm/tdx.h>
>  #include <asm/vmx.h>
> +#include <asm/insn.h>
> +#include <linux/sched/signal.h> /* force_sig_fault() */
>  
>  #include <linux/cpu.h>
>  #include <linux/protected_guest.h>
> @@ -209,6 +211,101 @@ static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
>  	}
>  }
>  
> +static unsigned long tdg_mmio(int size, bool write, unsigned long addr,
> +		unsigned long val)
> +{
> +	return tdx_hypercall_out_r11(EXIT_REASON_EPT_VIOLATION, size,
> +				     write, addr, val);
> +}
> +
> +static inline void *get_reg_ptr(struct pt_regs *regs, struct insn *insn)
> +{
> +	static const int regoff[] = {
> +		offsetof(struct pt_regs, ax),
> +		offsetof(struct pt_regs, cx),
> +		offsetof(struct pt_regs, dx),
> +		offsetof(struct pt_regs, bx),
> +		offsetof(struct pt_regs, sp),
> +		offsetof(struct pt_regs, bp),
> +		offsetof(struct pt_regs, si),
> +		offsetof(struct pt_regs, di),
> +		offsetof(struct pt_regs, r8),
> +		offsetof(struct pt_regs, r9),
> +		offsetof(struct pt_regs, r10),
> +		offsetof(struct pt_regs, r11),
> +		offsetof(struct pt_regs, r12),
> +		offsetof(struct pt_regs, r13),
> +		offsetof(struct pt_regs, r14),
> +		offsetof(struct pt_regs, r15),
> +	};
> +	int regno;
> +
> +	regno = X86_MODRM_REG(insn->modrm.value);
> +	if (X86_REX_R(insn->rex_prefix.value))
> +		regno += 8;
> +
> +	return (void *)regs + regoff[regno];
> +}

Was there a reason you copied and pasted this from get_reg_offset()
instead of refactoring?  This looks like almost entirely a subset of
get_reg_offset().

> +static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> +{
> +	int size;
> +	bool write;
> +	unsigned long *reg;
> +	struct insn insn;
> +	unsigned long val = 0;
> +
> +	/*
> +	 * User mode would mean the kernel exposed a device directly
> +	 * to ring3, which shouldn't happen except for things like
> +	 * DPDK.
> +	 */

Uhh....

	https://www.kernel.org/doc/html/v4.14/driver-api/uio-howto.html

I thought there were more than a few ways that userspace could get
access to MMIO mappings.

Also, do most people know what DPDK is?  Should we even be talking about
silly out-of-tree kernel bypass schemes in kernel comments?

> +	if (user_mode(regs)) {
> +		pr_err("Unexpected user-mode MMIO access.\n");
> +		force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);

						       extra space ^

Is a non-ratelimited pr_err() appropriate here?  I guess there shouldn't
be any MMIO passthrough to userspace on these systems.

> +		return 0;
> +	}
> +
> +	kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
> +	insn_get_length(&insn);
> +	insn_get_opcode(&insn);
> +
> +	write = ve->exit_qual & 0x2;
> +
> +	size = insn.opnd_bytes;
> +	switch (insn.opcode.bytes[0]) {
> +	/* MOV r/m8	r8	*/
> +	case 0x88:
> +	/* MOV r8	r/m8	*/
> +	case 0x8A:
> +	/* MOV r/m8	imm8	*/
> +	case 0xC6:

FWIW, I find that *REALLY* hard to read.

Check out is_string_insn() for a more readable example.

Oh, and I misread that.  I read it as "these are all the opcodes we care
about".  When, in fact, I _think_ it's all the opcodes that don't have a
size in insn.opnd_bytes.

Could you spell that out, please?

> +		size = 1;
> +		break;
> +	}
> +
> +	if (inat_has_immediate(insn.attr)) {
> +		BUG_ON(!write);
> +		val = insn.immediate.value;

This is pretty interesting.  This won't work with implicit accesses.  I
guess the limited opcodes above limit how much imprecision will result.
 But, it would still be nice to hear something about that.

For instance, if someone pointed a mid-level page table to MMIO, we'd
get a va->gpa that had zero to do with the instruction.  Granted, that's
only going to happen if something bonkers is going on, but maybe I'm
missing some simpler cases of implicit accesses.

> +		tdg_mmio(size, write, ve->gpa, val);

What happens if this is an MMIO operation that *partially* touches MMIO
and partially touches normal memory?  Let's say I wrote two bytes
(0x1234), starting at the last byte of a RAM page that ran over into an
MMIO page.  The fault would occur trying to write 0x34 to the MMIO, but
the instruction cracking would result in trying to write 0x1234 into the
MMIO.

It doesn't seem *that* outlandish that an MMIO might cross a page
boundary.  Would this work for a two-byte MMIO that crosses a page?

> +		return insn.length;
> +	}
> +
> +	BUG_ON(!inat_has_modrm(insn.attr));

A comment would be nice here about the BUG_ON().

It would also be nice to give a high-level view of what's going on and
what we know about the instruction at this point.

> +	reg = get_reg_ptr(regs, &insn);
> +
> +	if (write) {
> +		memcpy(&val, reg, size);
> +		tdg_mmio(size, write, ve->gpa, val);
> +	} else {
> +		val = tdg_mmio(size, write, ve->gpa, val);
> +		memset(reg, 0, size);
> +		memcpy(reg, &val, size);
> +	}
> +	return insn.length;
> +}
> +
>  unsigned long tdg_get_ve_info(struct ve_info *ve)
>  {
>  	u64 ret;
> @@ -258,6 +355,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
>  	case EXIT_REASON_IO_INSTRUCTION:
>  		tdg_handle_io(regs, ve->exit_qual);
>  		break;
> +	case EXIT_REASON_EPT_VIOLATION:
> +		ve->instr_len = tdg_handle_mmio(regs, ve);
> +		break;
>  	default:
>  		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
>  		return -EFAULT;
> 


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-18  0:09     ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-05-18 15:11       ` Dave Hansen
  2021-05-18 15:45         ` Andi Kleen
  2021-05-21 18:45       ` [RFC v2-fix " Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 15:11 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Sean Christopherson

On 5/17/21 5:09 PM, Kuppuswamy Sathyanarayanan wrote:
> After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
> although we don't expect it to happen because we don't expect NMIs to
> trigger #VEs. Another case where they could happen is if the #VE
> exception panics, but in this case there are no guarantees on anything
> anyways.

This implies: "we do not expect any NMI to do MMIO".  Is that true?  Why?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-18 15:11       ` Dave Hansen
@ 2021-05-18 15:45         ` Andi Kleen
  2021-05-18 15:56           ` Dave Hansen
  2021-05-21 19:22           ` Dan Williams
  0 siblings, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 15:45 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson


On 5/18/2021 8:11 AM, Dave Hansen wrote:
> On 5/17/21 5:09 PM, Kuppuswamy Sathyanarayanan wrote:
>> After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
>> although we don't expect it to happen because we don't expect NMIs to
>> trigger #VEs. Another case where they could happen is if the #VE
>> exception panics, but in this case there are no guarantees on anything
>> anyways.
> This implies: "we do not expect any NMI to do MMIO".  Is that true?  Why?

Only drivers that are not supported in TDX anyways could do it (mainly 
watchdog drivers)

panic is an exception, but that has been already covered.

-Andi




^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18  0:15                 ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-05-18 15:51                   ` Dave Hansen
  2021-05-18 16:23                     ` Sean Christopherson
  2021-05-18 20:12                     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 15:51 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Isaku Yamahata

Question for KVM folks: Should all of these guest patches say:
"x86/tdx/guest:" or something?  It seems like that would put us all in
the right frame of mind as we review these.  It's kinda easy (for me at
least) to get lost about which side I'm looking at sometimes.

On 5/17/21 5:15 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> KVM hypercalls use the "vmcall" or "vmmcall" instructions.
> Although the ABI is similar, those instructions no longer
> function for TDX guests. Make vendor specififc TDVMCALLs

				"vendor-specific"

		    Hyphen and spelling ^

> instead of VMCALL.

This would also be a great place to say:

This enables TDX guests to run with KVM acting as the hypervisor.  TDX
guests running under other hypervisors will continue to use those
hypervisors hypercalls.

> [Isaku: proposed KVM VENDOR string]
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

This SoB chain is odd.  Kirill wrote this, sent it to Isaku, who sent it
to Sathya?

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9e0e0ff76bab..768df1b98487 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -886,6 +886,12 @@ config INTEL_TDX_GUEST
>  	  run in a CPU mode that protects the confidentiality of TD memory
>  	  contents and the TD’s CPU state from other software, including VMM.
>  
> +config INTEL_TDX_GUEST_KVM
> +	def_bool y
> +	depends on KVM_GUEST && INTEL_TDX_GUEST
> +	help
> +	 This option enables KVM specific hypercalls in TDX guest.

For something that's not user-visible, I'd probably just add a Kconfig
comment rather than help text.

...
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 7966c10ea8d1..a90fec004844 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -128,6 +128,7 @@ obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
>  
>  obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
>  obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
> +obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o

Is the indentation consistent with the other items near "tdx-kvm.o" in
the Makefile?

...
> +/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
> +long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
> +		unsigned long p3, unsigned long p4)
> +{
> +	return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
> +}
> +EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);

I always forget that KVM code is goofy and needs to have things in C
files so you can export the symbols.  Could you add a sentence to the
changelog to this effect?

Code-wise, this is fine.  Just a few tweaks and I'll be happy to ack
this one.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-18 15:45         ` Andi Kleen
@ 2021-05-18 15:56           ` Dave Hansen
  2021-05-18 16:00             ` Andi Kleen
  2021-05-21 19:22           ` Dan Williams
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 15:56 UTC (permalink / raw)
  To: Andi Kleen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson

On 5/18/21 8:45 AM, Andi Kleen wrote:
> 
> On 5/18/2021 8:11 AM, Dave Hansen wrote:
>> On 5/17/21 5:09 PM, Kuppuswamy Sathyanarayanan wrote:
>>> After TDGETVEINFO #VE could happen in theory (e.g. through an NMI),
>>> although we don't expect it to happen because we don't expect NMIs to
>>> trigger #VEs. Another case where they could happen is if the #VE
>>> exception panics, but in this case there are no guarantees on anything
>>> anyways.
>> This implies: "we do not expect any NMI to do MMIO".  Is that true?  Why?
> 
> Only drivers that are not supported in TDX anyways could do it (mainly
> watchdog drivers)

No APIC access either?

Also, shouldn't we have at least a:

	WARN_ON_ONCE(in_nmi());

if we don't expect (or handle well) #VE in NMIs?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 15:00       ` Dave Hansen
@ 2021-05-18 15:56         ` Andi Kleen
  2021-05-18 16:04           ` Dave Hansen
  2021-05-18 17:11           ` Sean Christopherson
  2021-05-18 16:18         ` Sean Christopherson
  1 sibling, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 15:56 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel


On 5/18/2021 8:00 AM, Dave Hansen wrote:
> On 5/17/21 5:48 PM, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> In traditional VMs, MMIO tends to be implemented by giving a
>> guest access to a mapping which will cause a VMEXIT on access.
>> That's not possible in TDX guest.
> Why is it not possible?

For once the TDX module doesn't support uncached mappings (IgnorePAT is 
always 1)




>
>> For now we only handle a subset of instructions that the kernel
>> uses for MMIO operations. User-space access triggers SIGBUS.
> How do you know which instructions the kernel uses?

They're all in MMIO macros.


>   How do you know
> that the compiler won't change them?

The macros try hard to prevent that because it would likely break real 
MMIO too.

Besides it works for others, like AMD-SEV today and of course all the 
hypervisors that do the same.




> That sounds like something objective we can measure.  Does this cost 1
> byte of extra text per readl/writel?  10?  100?

Alternatives are at least a pointer, but also the extra alternative 
code. It's definitely more than 10, I would guess 40+



>
> I thought there were more than a few ways that userspace could get
> access to MMIO mappings.

Yes and they will all fault in TDX guests.


>> +	if (user_mode(regs)) {
>> +		pr_err("Unexpected user-mode MMIO access.\n");
>> +		force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);
> 						       extra space ^
>
> Is a non-ratelimited pr_err() appropriate here?  I guess there shouldn't
> be any MMIO passthrough to userspace on these systems.
Yes rate limiting makes sense.


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/traps: Add #VE support for TDX guest
  2021-05-18 15:56           ` Dave Hansen
@ 2021-05-18 16:00             ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 16:00 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel,
	Sean Christopherson


> No APIC access either?


It's all X2APIC inside TDX which uses MSRs

>
> Also, shouldn't we have at least a:
>
> 	WARN_ON_ONCE(in_nmi());
>
> if we don't expect (or handle well) #VE in NMIs?

We handle it perfectly fine. It's just not needed.


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 15:56         ` Andi Kleen
@ 2021-05-18 16:04           ` Dave Hansen
  2021-05-18 16:10             ` Andi Kleen
  2021-05-18 17:11           ` Sean Christopherson
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 16:04 UTC (permalink / raw)
  To: Andi Kleen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 5/18/21 8:56 AM, Andi Kleen wrote:
> On 5/18/2021 8:00 AM, Dave Hansen wrote:
>> On 5/17/21 5:48 PM, Kuppuswamy Sathyanarayanan wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>
>>> In traditional VMs, MMIO tends to be implemented by giving a
>>> guest access to a mapping which will cause a VMEXIT on access.
>>> That's not possible in TDX guest.
>> Why is it not possible?
> 
> For once the TDX module doesn't support uncached mappings (IgnorePAT is
> always 1)

Actually, I was thinking more along the lines of why the architecture
doesn't have VMEXITs:  VMEXITs expose guest state to the host and VMMs
use that state to emulate MMIO.  TDX guests don't trust the host and
can't have that arbitrary state exposed to the host.  So, they sanitize
the state in the #VE handler and make a *controlled* transition into the
host with a TDCALL rather than an uncontrolled VMEXIT.

>>> For now we only handle a subset of instructions that the kernel
>>> uses for MMIO operations. User-space access triggers SIGBUS.
>> How do you know which instructions the kernel uses?
> 
> They're all in MMIO macros.

I've heard exactly the opposite from the TDX team in the past.  What I
remember was a claim that one can not just leverage the MMIO macros as a
single point to avoid MMIO.  I remember being told that not all code in
the kernel that does MMIO uses these macros.  APIC MMIO's were called
out as a place that does not use the MMIO macros.

I'm confused now.

>>   How do you know that the compiler won't change them?
> 
> The macros try hard to prevent that because it would likely break real
> MMIO too.
> 
> Besides it works for others, like AMD-SEV today and of course all the
> hypervisors that do the same.

That would be some excellent information for the changelog.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 16:04           ` Dave Hansen
@ 2021-05-18 16:10             ` Andi Kleen
  2021-05-18 16:22               ` Dave Hansen
  2021-05-18 17:28               ` Andi Kleen
  0 siblings, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 16:10 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel


>>>> For now we only handle a subset of instructions that the kernel
>>>> uses for MMIO operations. User-space access triggers SIGBUS.
>>> How do you know which instructions the kernel uses?
>> They're all in MMIO macros.
> I've heard exactly the opposite from the TDX team in the past.  What I
> remember was a claim that one can not just leverage the MMIO macros as a
> single point to avoid MMIO.  I remember being told that not all code in
> the kernel that does MMIO uses these macros.  APIC MMIO's were called
> out as a place that does not use the MMIO macros.

Yes x86 APIC has its own macros, but we don't use the MMIO based APIC, 
only X2APIC in TDX.

I'm not aware of any other places that would do MMIO without using the 
standard io.h macros, although it might happen in theory on x86 (but 
would likely break on some other architectures)


-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 15:00       ` Dave Hansen
  2021-05-18 15:56         ` Andi Kleen
@ 2021-05-18 16:18         ` Sean Christopherson
  2021-05-18 17:15           ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 16:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On Tue, May 18, 2021, Dave Hansen wrote:
> On 5/17/21 5:48 PM, Kuppuswamy Sathyanarayanan wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > In traditional VMs, MMIO tends to be implemented by giving a
> > guest access to a mapping which will cause a VMEXIT on access.
> > That's not possible in TDX guest.
> 
> Why is it not possible?

It is possible, and in fact KVM will cause a VM-Exit on the first access to a
given MMIO page.  The problem is that guest state is inaccessible and so the VMM
cannot do the front end of MMIO instruction emulation.

> > So use #VE to implement MMIO support. In TDX guest, MMIO triggers #VE
> > with EPT_VIOLATION exit reason.

It's more accurate to say that the VMM will configure EPT entries for pages that
require instruction emulation to cause #VE.

> What does the #VE handler do to resolve the exception?
> 
> > For now we only handle a subset of instructions that the kernel
> > uses for MMIO operations. User-space access triggers SIGBUS.
> 
> How do you know which instructions the kernel uses?  How do you know
> that the compiler won't change them?
>
> I guess the kernel won't boot far if this happens, but this still sounds
> like trial-and-error programming.

If a driver accesses MMIO through a struct overlay, all bets are off.  The I/O
APIC code does this, but that problem is "solved" by forcefully disabling the
I/O APIC since it's useless for the current incarnation of TDX.  IIRC, some of
the console code also accesses MMIO via a struct (or maybe just through generic
C code), and the compiler does indeed employ a wider variety of instructions.

So yeah, whack-a-mole.
 
> > Also, reasons for supporting #VE based MMIO in TDX guest are,
> > 
> > * MMIO is widely used and we'll have more drivers in the future.
> 
> OK, but you've also made a big deal about having to go explicitly audit
> these drivers.  I would imagine converting these over to stop using MMIO
> would be _relatively_ minor compared 

For drivers that use the kernel's macros, converting them to use TDVMCALL
directly will be trivial and shouldn't even require any modifications to the
driver.  For drivers that use a struct overlay or generic C code, the "conversion"
could require a complete rewrite of the driver.

> to a big security audit and new fuzzing infrastructure.
> 
> > * We don't want to annotate every TDX specific MMIO readl/writel etc.
> 
> 				    ^ TDX-specific
> 
> > * If we didn't annotate we would need to add an alternative to every
> >   MMIO access in the kernel (even though 99.9% will never be used on
> >   TDX) which would be a complete waste and incredible binary bloat
> >   for nothing.
> 
> That sounds like something objective we can measure.  Does this cost 1
> byte of extra text per readl/writel?  10?  100?

Agreed.  And IMO, it's worth converting the common case (macros) if the overhead
is acceptable, while leaving the #VE handling in place for non-standard code.

> > +static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 
...

> > +		return 0;
> > +	}
> > +
> > +	kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
> > +	insn_get_length(&insn);
> > +	insn_get_opcode(&insn);
> > +
> > +	write = ve->exit_qual & 0x2;
> > +
> > +	size = insn.opnd_bytes;
> > +	switch (insn.opcode.bytes[0]) {
> > +	/* MOV r/m8	r8	*/
> > +	case 0x88:
> > +	/* MOV r8	r/m8	*/
> > +	case 0x8A:
> > +	/* MOV r/m8	imm8	*/
> > +	case 0xC6:
> 
> FWIW, I find that *REALLY* hard to read.

Why does this code exist at all?  TDX and SEV-ES absolutely must share code for
handling MMIO reflection.  It will require a fair amount of refactoring to move
the guts of vc_handle_mmio() to common code, but there is zero reason to maintain
two separate versions of the opcode cracking.

Ditto for string I/O in vc_handle_ioio().

> What happens if this is an MMIO operation that *partially* touches MMIO
> and partially touches normal memory?  Let's say I wrote two bytes
> (0x1234), starting at the last byte of a RAM page that ran over into an
> MMIO page.  The fault would occur trying to write 0x34 to the MMIO, but
> the instruction cracking would result in trying to write 0x1234 into the
> MMIO.
> 
> It doesn't seem *that* outlandish that an MMIO might cross a page
> boundary.  Would this work for a two-byte MMIO that crosses a page?

I'm pretty sure we can get away with panic (kernel) and SIGBUS (userspace) on
a reflected memory access that splits a page.  Yes, it's theoretically possible
and probably even "works", but practically speaking no emulated MMIO device is
going to have a single logical register/thing split a page, and I can't think of
any reason to allow accessing multiple registers/things across a page split.

The existing SEV-ES #VC handlers appear to be missing page split checks, so that
needs to be fixed.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 16:10             ` Andi Kleen
@ 2021-05-18 16:22               ` Dave Hansen
  2021-05-18 17:05                 ` Andi Kleen
  2021-05-18 17:28               ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 16:22 UTC (permalink / raw)
  To: Andi Kleen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel

On 5/18/21 9:10 AM, Andi Kleen wrote:
> I'm not aware of any other places that would do MMIO without using the
> standard io.h macros, although it might happen in theory on x86 (but
> would likely break on some other architectures)

Can we please connect all of the dots and turn this into a coherent
changelog?

 * In-kernel MMIO is handled via exceptions (#VE) and instruction
   cracking
 * Arbitrary MMIO instructions are not handled (and would result in...)
 * The limited set of MMIO instructions that are handled are known and
   come from the io.h macros, ultimately build_mmio_read/write().
 * This approach is also used for SEV-ES???
 * Some x86 code that avoids the MMIO code is known to exist (APIC).
   But, this code is not used in TDX guests

BTW, in perusing arch/x86/include/asm/io.h, I was reminded of movdir64b.
 That seems like one we'd want to take care of sooner rather than later.
 Or, do we expect the first folks who expose a movdir64b-using driver to
TDX to go and update this code?

Also, the sev_key_active() stuff in there makes me nervous.  Does this
scheme work with these:

> static inline void outs##bwl(int port, const void *addr, unsigned long count) \
> static inline void ins##bwl(int port, void *addr, unsigned long count)  \

?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18 15:51                   ` Dave Hansen
@ 2021-05-18 16:23                     ` Sean Christopherson
  2021-05-18 20:12                     ` Kuppuswamy, Sathyanarayanan
  1 sibling, 0 replies; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 16:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel, Isaku Yamahata

On Tue, May 18, 2021, Dave Hansen wrote:
> Question for KVM folks: Should all of these guest patches say:
> "x86/tdx/guest:" or something?

x86/tdx is fine.  The KVM convention is to use "KVM: xxx:" for KVM host code and
"x86/kvm" for KVM guest code.  E.g. for KVM TDX host code, the subjects will be
"KVM: x86:", "KVM: VMX:" or "KVM: TDX:".

The one I really don't like is using "tdg_" as the acronym for guest functions.
I find that really confusion and grep-unfriendly.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 16:22               ` Dave Hansen
@ 2021-05-18 17:05                 ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 17:05 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel


>   Or, do we expect the first folks who expose a movdir64b-using driver to
> TDX to go and update this code?

That's what we want to do.


>
> Also, the sev_key_active() stuff in there makes me nervous.  Does this
> scheme work with these:
>
>> static inline void outs##bwl(int port, const void *addr, unsigned long count) \
>> static inline void ins##bwl(int port, void *addr, unsigned long count)  \
> ?


This is not MMIO, but port IO. We do similar changes as AMD for TDX.


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 15:56         ` Andi Kleen
  2021-05-18 16:04           ` Dave Hansen
@ 2021-05-18 17:11           ` Sean Christopherson
  2021-05-18 17:21             ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 17:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On Tue, May 18, 2021, Andi Kleen wrote:
> On 5/18/2021 8:00 AM, Dave Hansen wrote:
> > That sounds like something objective we can measure.  Does this cost 1
> > byte of extra text per readl/writel?  10?  100?
> 
> Alternatives are at least a pointer, but also the extra alternative code.
> It's definitely more than 10, I would guess 40+

The extra bytes for .altinstructions is very different than the extra bytes for
the code itself.  The .altinstructions section is freed after init, so yes it
bloats the kernel size a bit, but the runtime footprint is unaffected by the
patching metadata.

IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.

The other option to explore is to hook/patch IO_COND(), which can be done with
neglible overhead because the helpers that use IO_COND() are not inlined.  In a
TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
And if there are TDX VMMs that want to deploy virtio-mmio, hooking
drivers/virtio/virtio_mmio.c directly would be a viable option.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 16:18         ` Sean Christopherson
@ 2021-05-18 17:15           ` Andi Kleen
  2021-05-18 18:17             ` Sean Christopherson
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 17:15 UTC (permalink / raw)
  To: Sean Christopherson, Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel


>>> * If we didn't annotate we would need to add an alternative to every
>>>    MMIO access in the kernel (even though 99.9% will never be used on
>>>    TDX) which would be a complete waste and incredible binary bloat
>>>    for nothing.
>> That sounds like something objective we can measure.  Does this cost 1
>> byte of extra text per readl/writel?  10?  100?
> Agreed.  And IMO, it's worth converting the common case (macros) if the overhead
> is acceptable, while leaving the #VE handling in place for non-standard code.

We have many millions of lines of MMIO using driver code in the kernel 
99.99% of which never runs in TDX. I don't see any point in impacting 
everything for this. That would be just against all good code change 
hygiene practices, and also just be bloated.

But we also don't don't want to touch every driver, for similar reasons.

What I think would make sense is to convert something to a direct TDCALL 
if we figure out the extra #VE is a real life performance problem. AFAIK 
the only candidate that I have in mind for this is the virtio doorbell 
write (and potentially later its VMBus equivalent). But we should really 
only do that if some measurements show it's needed.



> Why does this code exist at all?  TDX and SEV-ES absolutely must share code for
> handling MMIO reflection.  It will require a fair amount of refactoring to move
> the guts of vc_handle_mmio() to common code, but there is zero reason to maintain
> two separate versions of the opcode cracking.

While that's true on the high level, all the low level details are 
different. We looked at unifying at some point, but it would have been a 
callback hell. I don't think unifying would make anything cleaner.

Besides the bulk of the decoding work is already unified in the common 
x86 instruction decoder. The actual actions are different, and the code 
fetching is also different, so on the rest there isn't that much to unify.


> The existing SEV-ES #VC handlers appear to be missing page split checks, so that
> needs to be fixed.

Only if anyone in the kernel actually relies on it?


-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 17:11           ` Sean Christopherson
@ 2021-05-18 17:21             ` Andi Kleen
  2021-05-18 17:46               ` Dave Hansen
  2021-05-18 18:22               ` Sean Christopherson
  0 siblings, 2 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 17:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel


> The extra bytes for .altinstructions is very different than the extra bytes for
> the code itself.  The .altinstructions section is freed after init, so yes it
> bloats the kernel size a bit, but the runtime footprint is unaffected by the
> patching metadata.
>
> IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.
>
> The other option to explore is to hook/patch IO_COND(), which can be done with
> neglible overhead because the helpers that use IO_COND() are not inlined.  In a
> TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
> majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
> And if there are TDX VMMs that want to deploy virtio-mmio, hooking
> drivers/virtio/virtio_mmio.c directly would be a viable option.

Yes but what's the point of all that?

Even if it's only 3 bytes we still have a lot of MMIO all over the 
kernel which never needs it.

And I don't even see what TDX (or SEV which already does the decoding 
and has been merged) would get out of it. We handle all the #VEs just 
fine. And the instruction handling code is fairly straight forward too.

Besides instruction decoding works fine for all the existing 
hypervisors. All we really want to do is to do the same thing as KVM 
would do.

-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 16:10             ` Andi Kleen
  2021-05-18 16:22               ` Dave Hansen
@ 2021-05-18 17:28               ` Andi Kleen
  1 sibling, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 17:28 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, Sean Christopherson, linux-kernel


On 5/18/2021 9:10 AM, Andi Kleen wrote:
>
>>>>> For now we only handle a subset of instructions that the kernel
>>>>> uses for MMIO operations. User-space access triggers SIGBUS.
>>>> How do you know which instructions the kernel uses?
>>> They're all in MMIO macros.
>> I've heard exactly the opposite from the TDX team in the past. What I
>> remember was a claim that one can not just leverage the MMIO macros as a
>> single point to avoid MMIO.  I remember being told that not all code in
>> the kernel that does MMIO uses these macros.  APIC MMIO's were called
>> out as a place that does not use the MMIO macros.
>
> Yes x86 APIC has its own macros, but we don't use the MMIO based APIC, 
> only X2APIC in TDX.

I must correct myself here. We actually use #VE to handle MSRs, or at 
least those that are not context switched by the TDX module. So there 
can be #VE nested in NMI in normal operation, since MSR accesses in NMI 
can happen.

I don't think it needs any changes to the code -- this should all work 
-- but we need to update the commit log to document this case.


-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 17:21             ` Andi Kleen
@ 2021-05-18 17:46               ` Dave Hansen
  2021-05-18 18:36                 ` Sean Christopherson
  2021-05-18 20:20                 ` Andi Kleen
  2021-05-18 18:22               ` Sean Christopherson
  1 sibling, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 17:46 UTC (permalink / raw)
  To: Andi Kleen, Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

On 5/18/21 10:21 AM, Andi Kleen wrote:
> Besides instruction decoding works fine for all the existing
> hypervisors. All we really want to do is to do the same thing as KVM
> would do.

Dumb question of the day: If you want to do the same thing that KVM
does, why don't you share more code with KVM?  Wouldn't you, for
instance, need to crack the same instruction opcodes?

I'd feel a lot better about this if you said:

	Listen, this doesn't work for everything.  But, it will run
	every single driver as a TDX guest that KVM can handle as a
	host.  So, if the TDX code is broken, so is the KVM host code.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 17:15           ` Andi Kleen
@ 2021-05-18 18:17             ` Sean Christopherson
  2021-05-20 22:47               ` Kirill A. Shutemov
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 18:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On Tue, May 18, 2021, Andi Kleen wrote:
> > Why does this code exist at all?  TDX and SEV-ES absolutely must share code for
> > handling MMIO reflection.  It will require a fair amount of refactoring to move
> > the guts of vc_handle_mmio() to common code, but there is zero reason to maintain
> > two separate versions of the opcode cracking.
> 
> While that's true on the high level, all the low level details are
> different. We looked at unifying at some point, but it would have been a
> callback hell. I don't think unifying would make anything cleaner.

How hard did you look?  The only part that _must_ be different between SEV and
TDX is the hypercall itself, which is wholly contained at the very end of
vc_do_mmio().

Despite vc_slow_virt_to_phys() taking a pointer to the ghcb, it's unused and
thus the function is 100% generic.

The ghcb->shared_buffer usage throughout the upper levels can be eliminated by
refactoring the stack to take a "u64 *val", since MMIO accesses are currently
bounded to 8 bytes.

> Besides the bulk of the decoding work is already unified in the common x86
> instruction decoder. The actual actions are different, and the code fetching
> is also different 

Huh?  What do you mean by "actual actions"?  Why is the code fetch different?

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 17:21             ` Andi Kleen
  2021-05-18 17:46               ` Dave Hansen
@ 2021-05-18 18:22               ` Sean Christopherson
  2021-05-18 20:28                 ` Andi Kleen
  1 sibling, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 18:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On Tue, May 18, 2021, Andi Kleen wrote:
> 
> > The extra bytes for .altinstructions is very different than the extra bytes for
> > the code itself.  The .altinstructions section is freed after init, so yes it
> > bloats the kernel size a bit, but the runtime footprint is unaffected by the
> > patching metadata.
> > 
> > IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.
> > 
> > The other option to explore is to hook/patch IO_COND(), which can be done with
> > neglible overhead because the helpers that use IO_COND() are not inlined.  In a
> > TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
> > majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
> > And if there are TDX VMMs that want to deploy virtio-mmio, hooking
> > drivers/virtio/virtio_mmio.c directly would be a viable option.
> 
> Yes but what's the point of all that?

Patching IO_COND() is relatively low effort.  With some clever refactoring, I
suspect the net lines of code added would be less than 10.  That seems like a
worthwhile effort to avoid millions of faults over the lifetime of the guest.

> Even if it's only 3 bytes we still have a lot of MMIO all over the kernel
> which never needs it.
> 
> And I don't even see what TDX (or SEV which already does the decoding and
> has been merged) would get out of it. We handle all the #VEs just fine. And
> the instruction handling code is fairly straight forward too.
> 
> Besides instruction decoding works fine for all the existing hypervisors.
> All we really want to do is to do the same thing as KVM would do.

Heh, trust me, you don't want to do the same thing KVM does :-)

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 17:46               ` Dave Hansen
@ 2021-05-18 18:36                 ` Sean Christopherson
  2021-05-18 20:20                 ` Andi Kleen
  1 sibling, 0 replies; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 18:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On Tue, May 18, 2021, Dave Hansen wrote:
> On 5/18/21 10:21 AM, Andi Kleen wrote:
> > Besides instruction decoding works fine for all the existing
> > hypervisors. All we really want to do is to do the same thing as KVM
> > would do.
> 
> Dumb question of the day: If you want to do the same thing that KVM
> does, why don't you share more code with KVM?  Wouldn't you, for
> instance, need to crack the same instruction opcodes?

Pulling in all pf KVM's emulator is a bad idea from a security perspective.  That
could be mitigated to some extent by teaching the emulator to emulate only select
instructions, but it'd still be much higher risk than a barebones guest-specific
implementations.  Because old Intel CPUs don't support unrestricted guest, the set
of instructions that KVM _can_ emulate in total is far, far larger than what is
needed for MMIO.

Allowed instructions aside, KVM needs to handle a large number things a TDX/SEV
guest does not, e.g. segmentation, CPUID model, A/D bit updates, and so on and
so forth.

Refactoring KVM's emulator would also be a monumental task.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Make DMA pages shared
  2021-05-18  1:19   ` [RFC v2-fix 1/1] " Kuppuswamy Sathyanarayanan
@ 2021-05-18 19:55     ` Sean Christopherson
  2021-05-18 22:12       ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 19:55 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel, Kai Huang,
	Sean Christopherson

On Mon, May 17, 2021, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Intel TDX doesn't allow VMM to access guest memory. Any memory
                                             ^
                                             |- private

And to be pedantic, the VMM can _access_ guest private memory all it wants, it
just can't decrypt guest private memory.

> that is required for communication with VMM must be shared
> explicitly by setting the bit in page table entry. And, after
> setting the shared bit, the conversion must be completed with
> MapGPA TDVMALL. The call informs VMM about the conversion and
> makes it remove the GPA from the S-EPT mapping.

The VMM is _not_ required to remove the GPA from the S-EPT.  E.g. if the VMM
wants to, it can leave a 2mb private page intact and create a 4kb shared page
translation within the same range (ignoring the shared bit).

> The shared memory is similar to unencrypted memory in AMD SME/SEV
> terminology but the underlying process of sharing/un-sharing the memory is
> different for Intel TDX guest platform.
> 
> SEV assumes that I/O devices can only do DMA to "decrypted"
> physical addresses without the C-bit set.  In order for the CPU
> to interact with this memory, the CPU needs a decrypted mapping.
> To add this support, AMD SME code forces force_dma_unencrypted()
> to return true for platforms that support AMD SEV feature. It will
> be used for DMA memory allocation API to trigger
> set_memory_decrypted() for platforms that support AMD SEV feature.
> 
> TDX is similar.  TDX architecturally prevents access to private

TDX doesn't prevent accesses.  If hardware _prevented_ accesses then we wouldn't
have to deal with the #MC mess.

> guest memory by anything other than the guest itself. This means
> that any DMA buffers must be shared.
> 
> So create a new file mem_encrypt_tdx.c to hold TDX specific memory
> initialization code, and re-define force_dma_unencrypted() for
> TDX guest and make it return true to get DMA pages mapped as shared.
> 
> __set_memory_enc_dec() is now aware about TDX and sets Shared bit
> accordingly following with relevant TDVMCALL.
> 
> Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when

This should call out that the current TDX spec only supports 4kb AUG/ACCEPT.

On that topic... are there plans to support 2mb and/or 1gb TDH.MEM.PAGE.AUG?  If
so, will TDG.MEM.PAGE.ACCEPT also support 2mb/1gb granularity?

> converting memory to private.  If the VMM uses a common pool for private
> and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA
> (or on the first access to the private GPA),

What the VMM does or does not do is irrelevant.  What matters is what the VMM is
_allowed_ to do without violating the GHCI.  Specifically, the VMM is allowed to
unmap a private page in response to MAP_GPA to convert to a shared page.

  If the GPA (range) was already mapped as an active, private page, the host
  VMM may remove the private page from the TD by following the “Removing TD
  Private Pages” sequence in the Intel TDX-module specification [3] to safely
  block the mapping(s), flush the TLB and cache, and remove the mapping(s).

That would also provide a nice segue into the "already accepted" error below.

> in which case TDX-Module will hold the page in a non-present "pending" state
> until it is explicitly accepted.
> 
> BUG() if TDACCEPTPAGE fails (except the above case)

What above case?  The code handles the case where the page was already accepted,
but the changelog doesn't talk about that at all.  

> as the guest is completely hosed if it can't access memory. 


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18 15:51                   ` Dave Hansen
  2021-05-18 16:23                     ` Sean Christopherson
@ 2021-05-18 20:12                     ` Kuppuswamy, Sathyanarayanan
  2021-05-18 20:19                       ` Dave Hansen
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-18 20:12 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Isaku Yamahata



On 5/18/21 8:51 AM, Dave Hansen wrote:
> Question for KVM folks: Should all of these guest patches say:
> "x86/tdx/guest:" or something?  It seems like that would put us all in
> the right frame of mind as we review these.  It's kinda easy (for me at
> least) to get lost about which side I'm looking at sometimes.
> 
> On 5/17/21 5:15 PM, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> KVM hypercalls use the "vmcall" or "vmmcall" instructions.
>> Although the ABI is similar, those instructions no longer
>> function for TDX guests. Make vendor specififc TDVMCALLs
> 
> 				"vendor-specific"
> 
> 		    Hyphen and spelling ^

I will fix it next version.

> 
>> instead of VMCALL.
> 
> This would also be a great place to say:
> 
> This enables TDX guests to run with KVM acting as the hypervisor.  TDX
> guests running under other hypervisors will continue to use those
> hypervisors hypercalls.

I will include it.

> 
>> [Isaku: proposed KVM VENDOR string]
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> This SoB chain is odd.  Kirill wrote this, sent it to Isaku, who sent it
> to Sathya?

Initially we have used "0" as vendor ID for KVM. But Isaku proposed a new
value for it and sent a patch to fix it. But, I did not want to carry it as
separate patch (for one line change). So I have merged his change with
this patch, and added his signed-off with comment ([Isaku: proposed KVM VENDOR string])

+#define TDVMCALL_VENDOR_KVM            0x4d564b2e584454 /* "TDX.KVM" */


> 
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 9e0e0ff76bab..768df1b98487 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -886,6 +886,12 @@ config INTEL_TDX_GUEST
>>   	  run in a CPU mode that protects the confidentiality of TD memory
>>   	  contents and the TD’s CPU state from other software, including VMM.
>>   
>> +config INTEL_TDX_GUEST_KVM
>> +	def_bool y
>> +	depends on KVM_GUEST && INTEL_TDX_GUEST
>> +	help
>> +	 This option enables KVM specific hypercalls in TDX guest.
> 
> For something that's not user-visible, I'd probably just add a Kconfig
> comment rather than help text.

If it is the preferred approach, I can remove it.

> 
> ...
>> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
>> index 7966c10ea8d1..a90fec004844 100644
>> --- a/arch/x86/kernel/Makefile
>> +++ b/arch/x86/kernel/Makefile
>> @@ -128,6 +128,7 @@ obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
>>   
>>   obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
>>   obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
>> +obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o
> 
> Is the indentation consistent with the other items near "tdx-kvm.o" in
> the Makefile?

Yes. For longer config names, common indentation is not maintained. Please
check the PMEM example.

126 obj-$(CONFIG_PARAVIRT_CLOCK)    += pvclock.o
127 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
128
129 obj-$(CONFIG_JAILHOUSE_GUEST)   += jailhouse.o
130 obj-$(CONFIG_INTEL_TDX_GUEST)   += tdcall.o tdx.o
131 obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o


> 
> ...
>> +/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
>> +long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
>> +		unsigned long p3, unsigned long p4)
>> +{
>> +	return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
>> +}
>> +EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);
> 
> I always forget that KVM code is goofy and needs to have things in C
> files so you can export the symbols.  Could you add a sentence to the
> changelog to this effect?
> 
> Code-wise, this is fine.  Just a few tweaks and I'll be happy to ack
> this one.

Will add it.

     Since KVM hypercall functions can be included and called
     from kernel modules, export tdx_kvm_hypercall*() functions
     to avoid symbol errors


> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18 20:12                     ` Kuppuswamy, Sathyanarayanan
@ 2021-05-18 20:19                       ` Dave Hansen
  2021-05-18 20:57                         ` Kuppuswamy, Sathyanarayanan
  2021-05-18 21:19                         ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
  0 siblings, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 20:19 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Isaku Yamahata

On 5/18/21 1:12 PM, Kuppuswamy, Sathyanarayanan wrote:
>>> [Isaku: proposed KVM VENDOR string]
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>>> Signed-off-by: Kuppuswamy Sathyanarayanan
>>> <sathyanarayanan.kuppuswamy@linux.intel.com>
>>
>> This SoB chain is odd.  Kirill wrote this, sent it to Isaku, who sent it
>> to Sathya?
> 
> Initially we have used "0" as vendor ID for KVM. But Isaku proposed a new
> value for it and sent a patch to fix it. But, I did not want to carry it as
> separate patch (for one line change). So I have merged his change with
> this patch, and added his signed-off with comment ([Isaku: proposed KVM
> VENDOR string])
> 
> +#define TDVMCALL_VENDOR_KVM            0x4d564b2e584454 /* "TDX.KVM" */

That's a combined Co-developed-by+Signed-off-by situation.  You don't
add a bare SoB for that.

But, seriously, you don't need to preserve a SoB for a one-line patch.
Just pull the line in and make a note in the changelog.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 17:46               ` Dave Hansen
  2021-05-18 18:36                 ` Sean Christopherson
@ 2021-05-18 20:20                 ` Andi Kleen
  2021-05-18 20:40                   ` Dave Hansen
  1 sibling, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 20:20 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel


On 5/18/2021 10:46 AM, Dave Hansen wrote:
> On 5/18/21 10:21 AM, Andi Kleen wrote:
>> Besides instruction decoding works fine for all the existing
>> hypervisors. All we really want to do is to do the same thing as KVM
>> would do.
> Dumb question of the day: If you want to do the same thing that KVM
> does, why don't you share more code with KVM?  Wouldn't you, for
> instance, need to crack the same instruction opcodes?

We're talking about ~60 lines of codes that calls an established 
standard library.

https://github.com/intel/tdx/blob/8c20c364d1f52e432181d142054b1c2efa0ae6d3/arch/x86/kernel/tdx.c#L490

You're proposing a gigantic refactoring to avoid 60 lines of straight 
forward code.

That's not a practical proposal.

>
> I'd feel a lot better about this if you said:
>
> 	Listen, this doesn't work for everything.  But, it will run
> 	every single driver as a TDX guest that KVM can handle as a
> 	host.  So, if the TDX code is broken, so is the KVM host code.

I don't really know what problem you're trying to solve here. We only 
have a small number of drivers and we tested them and they work fine. 
There are special macros that limit the number of instructions. If there 
are ever more instructions and the macros break somehow we'll add them. 
There will be a clean error if it ever happens. We're not trying to 
solve hypothetical problems here.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 18:22               ` Sean Christopherson
@ 2021-05-18 20:28                 ` Andi Kleen
  2021-05-18 20:37                   ` Sean Christopherson
  0 siblings, 1 reply; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 20:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel


On 5/18/2021 11:22 AM, Sean Christopherson wrote:
> On Tue, May 18, 2021, Andi Kleen wrote:
>>> The extra bytes for .altinstructions is very different than the extra bytes for
>>> the code itself.  The .altinstructions section is freed after init, so yes it
>>> bloats the kernel size a bit, but the runtime footprint is unaffected by the
>>> patching metadata.
>>>
>>> IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.
>>>
>>> The other option to explore is to hook/patch IO_COND(), which can be done with
>>> neglible overhead because the helpers that use IO_COND() are not inlined.  In a
>>> TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
>>> majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
>>> And if there are TDX VMMs that want to deploy virtio-mmio, hooking
>>> drivers/virtio/virtio_mmio.c directly would be a viable option.
>> Yes but what's the point of all that?
> Patching IO_COND() is relatively low effort.  With some clever refactoring, I
> suspect the net lines of code added would be less than 10.  That seems like a
> worthwhile effort to avoid millions of faults over the lifetime of the guest.

AFAIK IO_COND is only for iomap users. But most drivers don't even use 
iomap. virtio doesn't for example, and that's really the only case we 
currently care about.

Also millions of faults is nothing for a CPU.

The only case I can see it making sense is the virtio (and vmbus) door 
bells. Everything else should be slow path anyways.

But doing that now would be premature optimization and that's usually a 
bad idea. If it's a problem we can fix it later.


>
>> Even if it's only 3 bytes we still have a lot of MMIO all over the kernel
>> which never needs it.
>>
>> And I don't even see what TDX (or SEV which already does the decoding and
>> has been merged) would get out of it. We handle all the #VEs just fine. And
>> the instruction handling code is fairly straight forward too.
>>
>> Besides instruction decoding works fine for all the existing hypervisors.
>> All we really want to do is to do the same thing as KVM would do.
> Heh, trust me, you don't want to do the same thing KVM does :-)

We want the same behavior.

Yes probably not the same code.


-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 20:28                 ` Andi Kleen
@ 2021-05-18 20:37                   ` Sean Christopherson
  2021-05-18 20:56                     ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Sean Christopherson @ 2021-05-18 20:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel

On Tue, May 18, 2021, Andi Kleen wrote:
> 
> On 5/18/2021 11:22 AM, Sean Christopherson wrote:
> > On Tue, May 18, 2021, Andi Kleen wrote:
> > > > The extra bytes for .altinstructions is very different than the extra bytes for
> > > > the code itself.  The .altinstructions section is freed after init, so yes it
> > > > bloats the kernel size a bit, but the runtime footprint is unaffected by the
> > > > patching metadata.
> > > > 
> > > > IIRC, patching read/write{b,w,l,q}() can be done with 3 bytes of .text overhead.
> > > > 
> > > > The other option to explore is to hook/patch IO_COND(), which can be done with
> > > > neglible overhead because the helpers that use IO_COND() are not inlined.  In a
> > > > TDX guest, redirecting IO_COND() to a paravirt helper would likely cover the
> > > > majority of IO/MMIO since virtio-pci exclusively uses the IO_COND() wrappers.
> > > > And if there are TDX VMMs that want to deploy virtio-mmio, hooking
> > > > drivers/virtio/virtio_mmio.c directly would be a viable option.
> > > Yes but what's the point of all that?
> > Patching IO_COND() is relatively low effort.  With some clever refactoring, I
> > suspect the net lines of code added would be less than 10.  That seems like a
> > worthwhile effort to avoid millions of faults over the lifetime of the guest.
> 
> AFAIK IO_COND is only for iomap users. But most drivers don't even use
> iomap. virtio doesn't for example, and that's really the only case we
> currently care about.

virtio-pci, which is going to used by pretty much all traditional VMs, uses iomap.
See vp_get(), vp_set(), and all the vp_io{read,write}*() wrappers.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 20:20                 ` Andi Kleen
@ 2021-05-18 20:40                   ` Dave Hansen
  2021-05-18 21:05                     ` Andi Kleen
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 20:40 UTC (permalink / raw)
  To: Andi Kleen, Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel

On 5/18/21 1:20 PM, Andi Kleen wrote:
> 
> On 5/18/2021 10:46 AM, Dave Hansen wrote:
>> On 5/18/21 10:21 AM, Andi Kleen wrote:
>>> Besides instruction decoding works fine for all the existing
>>> hypervisors. All we really want to do is to do the same thing as KVM
>>> would do.
>> Dumb question of the day: If you want to do the same thing that KVM
>> does, why don't you share more code with KVM?  Wouldn't you, for
>> instance, need to crack the same instruction opcodes?
> 
> We're talking about ~60 lines of codes that calls an established
> standard library.
> 
> https://github.com/intel/tdx/blob/8c20c364d1f52e432181d142054b1c2efa0ae6d3/arch/x86/kernel/tdx.c#L490
> 
> You're proposing a gigantic refactoring to avoid 60 lines of straight
> forward code.
> 
> That's not a practical proposal.

Hi Andi,

I'm not actually trying to propose things.  I'm really just trying to
get an idea why the implementation ended up how it did.  I actually
entirely respect the position that the KVM code is a monster and
shouldn't get reused.  That seems totally reasonable.

What isn't reasonable is the lack of documentation of these design
decisions in the changelogs.  My goal here is to raise the quality of
the changelogs so that other reviewers and maintainers don't have to ask
these questions when they perform their reviews.

This is honestly the best way I know to help get this code merged as
soon as possible.  If I'm not helping, please let me know.  I'm happy to
spend my time elsewhere.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 20:37                   ` Sean Christopherson
@ 2021-05-18 20:56                     ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 20:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, Kuppuswamy Sathyanarayanan, Peter Zijlstra,
	Andy Lutomirski, Tony Luck, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	linux-kernel



> virtio-pci, which is going to used by pretty much all traditional VMs, uses iomap.
> See vp_get(), vp_set(), and all the vp_io{read,write}*() wrappers.

That's true. But there are still all the other users. So it doesn't 
solve the problem. In the end I'm fairly sure we would need to patch 
readl/writel and friends.

-Andi


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18 20:19                       ` Dave Hansen
@ 2021-05-18 20:57                         ` Kuppuswamy, Sathyanarayanan
  2021-05-18 21:19                         ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-18 20:57 UTC (permalink / raw)
  To: Dave Hansen, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Isaku Yamahata



On 5/18/21 1:19 PM, Dave Hansen wrote:
> But, seriously, you don't need to preserve a SoB for a one-line patch.
> Just pull the line in and make a note in the changelog.

Ok. Makes sense. I will leave the comment and remove SOB from Isaku.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO
  2021-05-18 20:40                   ` Dave Hansen
@ 2021-05-18 21:05                     ` Andi Kleen
  0 siblings, 0 replies; 381+ messages in thread
From: Andi Kleen @ 2021-05-18 21:05 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson
  Cc: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski,
	Tony Luck, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel


> I'm not actually trying to propose things.  I'm really just trying to
> get an idea why the implementation ended up how it did.  I actually
> entirely respect the position that the KVM code is a monster and
> shouldn't get reused.  That seems totally reasonable.

Mainly because it's relatively simple and straight forward to do it this 
way, Yes I know, that's a shocking concept, but sometimes it works even 
in Linux code.

>
> What isn't reasonable is the lack of documentation of these design
> decisions in the changelogs.  My goal here is to raise the quality of
> the changelogs so that other reviewers and maintainers don't have to ask
> these questions when they perform their reviews.
>
> This is honestly the best way I know to help get this code merged as
> soon as possible.  If I'm not helping, please let me know.  I'm happy to
> spend my time elsewhere.

I'm sure the commit logs can be improved and I appreciate your feedback.


I don't think every commit log needs to be an extended essay meandering 
all over the possible design space, talking about everything that could 
have been and wasn't. The way code is normally written is that we don't 
do an exhaustive search of possible options, but instead we pick a 
reasonable path and as long as that works and doesn't have too many 
problems we just stick to it. The commit log reflects that single path 
chosen, with only rare exceptions to talk about dead alleys.

In this case you can even see that multiple independent efforts (AMD and 
Intel) came mostly to fairly similar implementations, so the path chosen 
wasn't really that strange or non obvious.

Also overall I would appreciate if people would focus more on the code 
than the commit logs. Commit logs are important, but in the end what 
really matters is that the code is correct.

-Andi



^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v2 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18 20:19                       ` Dave Hansen
  2021-05-18 20:57                         ` Kuppuswamy, Sathyanarayanan
@ 2021-05-18 21:19                         ` Kuppuswamy Sathyanarayanan
  2021-05-18 23:29                           ` Dave Hansen
  1 sibling, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-18 21:19 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

KVM hypercalls use the "vmcall" or "vmmcall" instructions.
Although the ABI is similar, those instructions no longer
function for TDX guests. Make vendor-specific TDVMCALLs
instead of VMCALL. This enables TDX guests to run with KVM
acting as the hypervisor. TDX guests running under other
hypervisors will continue to use those hypervisor's
hypercalls.

Since KVM hypercall functions can be included and called
from kernel modules, export tdx_kvm_hypercall*() functions
to avoid symbol errors

[Isaku Yamahata: proposed KVM VENDOR string]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2-fix:
 * Removed "user help" for INTEL_TDX_GUEST_KVM config option
   and added a comment for it.
 * Added details about exporting symbols in the commit log.
 * Removed Isaku's sign-off.

Changes since RFC v2: 
 * Introduced INTEL_TDX_GUEST_KVM config for TDX+KVM related changes.
 * Removed "C" include file.
 * Fixed commit log as per Dave's comments.

 arch/x86/Kconfig                |  5 ++++
 arch/x86/include/asm/kvm_para.h | 21 +++++++++++++++
 arch/x86/include/asm/tdx.h      | 41 ++++++++++++++++++++++++++++
 arch/x86/kernel/Makefile        |  1 +
 arch/x86/kernel/tdcall.S        | 20 ++++++++++++++
 arch/x86/kernel/tdx-kvm.c       | 48 +++++++++++++++++++++++++++++++++
 6 files changed, 136 insertions(+)
 create mode 100644 arch/x86/kernel/tdx-kvm.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9e0e0ff76bab..15e66a99dd41 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -886,6 +886,11 @@ config INTEL_TDX_GUEST
 	  run in a CPU mode that protects the confidentiality of TD memory
 	  contents and the TD’s CPU state from other software, including VMM.
 
+# This option enables KVM specific hypercalls in TDX guest.
+config INTEL_TDX_GUEST_KVM
+	def_bool y
+	depends on KVM_GUEST && INTEL_TDX_GUEST
+
 endif #HYPERVISOR_GUEST
 
 source "arch/x86/Kconfig.cpu"
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..2fa85481520b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,6 +6,7 @@
 #include <asm/alternative.h>
 #include <linux/interrupt.h>
 #include <uapi/asm/kvm_para.h>
+#include <asm/tdx.h>
 
 extern void kvmclock_init(void);
 
@@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 static inline long kvm_hypercall0(unsigned int nr)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall0(nr);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr)
@@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
 static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall1(nr, p1);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1)
@@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
 				  unsigned long p2)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall2(nr, p1, p2);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2)
@@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
 				  unsigned long p2, unsigned long p3)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall3(nr, p1, p2, p3);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 				  unsigned long p4)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8ab4067afefc..eb758b506dba 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -73,4 +73,45 @@ static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
+			       u64 r15, struct tdx_hypercall_output *out);
+long tdx_kvm_hypercall0(unsigned int nr);
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1);
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2);
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3);
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3, unsigned long p4);
+#else
+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+				      unsigned long p2)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3,
+				      unsigned long p4)
+{
+	return -ENODEV;
+}
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
+
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 7966c10ea8d1..a90fec004844 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -128,6 +128,7 @@ obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
 obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST_KVM) += tdx-kvm.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index a484c4aef6e6..3c57a1d67b79 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -25,6 +25,8 @@
 					  TDG_R12 | TDG_R13 | \
 					  TDG_R14 | TDG_R15 )
 
+#define TDVMCALL_VENDOR_KVM		0x4d564b2e584454 /* "TDX.KVM" */
+
 /*
  * TDX guests use the TDCALL instruction to make requests to the
  * TDX module and hypercalls to the VMM. It is supported in
@@ -213,3 +215,21 @@ SYM_FUNC_START(__tdx_hypercall)
 	call do_tdx_hypercall
 	retq
 SYM_FUNC_END(__tdx_hypercall)
+
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+/*
+ * Helper function for KVM vendor TDVMCALLs. This assembly wrapper
+ * lets us reuse do_tdvmcall() for KVM-specific hypercalls (
+ * TDVMCALL_VENDOR_KVM).
+ */
+SYM_FUNC_START(__tdx_hypercall_vendor_kvm)
+	/*
+	 * R10 is not part of the function call ABI, but it is a part
+	 * of the TDVMCALL ABI. So set it before making call to the
+	 * do_tdx_hypercall().
+	 */
+	movq $TDVMCALL_VENDOR_KVM, %r10
+	call do_tdx_hypercall
+	retq
+SYM_FUNC_END(__tdx_hypercall_vendor_kvm)
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
diff --git a/arch/x86/kernel/tdx-kvm.c b/arch/x86/kernel/tdx-kvm.c
new file mode 100644
index 000000000000..b21453a81e38
--- /dev/null
+++ b/arch/x86/kernel/tdx-kvm.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2020 Intel Corporation */
+
+#include <asm/tdx.h>
+
+static long tdx_kvm_hypercall(unsigned int fn, unsigned long r12,
+			      unsigned long r13, unsigned long r14,
+			      unsigned long r15)
+{
+	return __tdx_hypercall_vendor_kvm(fn, r12, r13, r14, r15, NULL);
+}
+
+/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall0);
+
+/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return tdx_kvm_hypercall(nr, p1, 0, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall1);
+
+/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2)
+{
+	return tdx_kvm_hypercall(nr, p1, p2, 0, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall2);
+
+/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3)
+{
+	return tdx_kvm_hypercall(nr, p1, p2, p3, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall3);
+
+/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
+long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2,
+		unsigned long p3, unsigned long p4)
+{
+	return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall4);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Make DMA pages shared
  2021-05-18 19:55     ` Sean Christopherson
@ 2021-05-18 22:12       ` Kuppuswamy, Sathyanarayanan
  2021-05-18 22:31         ` Dave Hansen
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-18 22:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Dan Williams, Raj Ashok, linux-kernel, Kai Huang,
	Sean Christopherson



On 5/18/21 12:55 PM, Sean Christopherson wrote:
> On Mon, May 17, 2021, Kuppuswamy Sathyanarayanan wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> Intel TDX doesn't allow VMM to access guest memory. Any memory
>                                               ^
>                                               |- private
> 
> And to be pedantic, the VMM can _access_ guest private memory all it wants, it
> just can't decrypt guest private memory.

Ok. I will use "guest private memory".

> 
>> that is required for communication with VMM must be shared
>> explicitly by setting the bit in page table entry. And, after
>> setting the shared bit, the conversion must be completed with
>> MapGPA TDVMALL. The call informs VMM about the conversion and
>> makes it remove the GPA from the S-EPT mapping.
> 
> The VMM is _not_ required to remove the GPA from the S-EPT.  E.g. if the VMM
> wants to, it can leave a 2mb private page intact and create a 4kb shared page
> translation within the same range (ignoring the shared bit).

So does removing "makes it remove the GPA from the S-EPT mapping"
be sufficient? Or you want to add more detail?


> 
>> The shared memory is similar to unencrypted memory in AMD SME/SEV
>> terminology but the underlying process of sharing/un-sharing the memory is
>> different for Intel TDX guest platform.
>>
>> SEV assumes that I/O devices can only do DMA to "decrypted"
>> physical addresses without the C-bit set.  In order for the CPU
>> to interact with this memory, the CPU needs a decrypted mapping.
>> To add this support, AMD SME code forces force_dma_unencrypted()
>> to return true for platforms that support AMD SEV feature. It will
>> be used for DMA memory allocation API to trigger
>> set_memory_decrypted() for platforms that support AMD SEV feature.
>>
>> TDX is similar.  TDX architecturally prevents access to private
> 
> TDX doesn't prevent accesses.  If hardware _prevented_ accesses then we wouldn't
> have to deal with the #MC mess.
How about following change?

"TDX is similar. TDX architecturally prevents access to private guest memory by
  anything other than the guest itself.This means that any DMA buffers must be
  shared."

modified to =>

"TDX is similar. In TDX architecture, the private guest memory is encrypted, which
prevents anything other than guest from accessing/modifying it. So to communicate
with I/O devices, we need to create decrypted mapping and make the pages shared."

> 
>> guest memory by anything other than the guest itself. This means
>> that any DMA buffers must be shared.
>>
>> So create a new file mem_encrypt_tdx.c to hold TDX specific memory
>> initialization code, and re-define force_dma_unencrypted() for
>> TDX guest and make it return true to get DMA pages mapped as shared.
>>
>> __set_memory_enc_dec() is now aware about TDX and sets Shared bit
>> accordingly following with relevant TDVMCALL.
>>
>> Also, Do TDACCEPTPAGE on every 4k page after mapping the GPA range when
> 
> This should call out that the current TDX spec only supports 4kb AUG/ACCEPT.

Ok. I will add this spec detail.

> 
> On that topic... are there plans to support 2mb and/or 1gb TDH.MEM.PAGE.AUG?  If
> so, will TDG.MEM.PAGE.ACCEPT also support 2mb/1gb granularity?
> 
>> converting memory to private.  If the VMM uses a common pool for private
>> and shared memory, it will likely do TDAUGPAGE in response to MAP_GPA
>> (or on the first access to the private GPA),
> 
> What the VMM does or does not do is irrelevant.  What matters is what the VMM is
> _allowed_ to do without violating the GHCI.  Specifically, the VMM is allowed to
> unmap a private page in response to MAP_GPA to convert to a shared page.
> 
>    If the GPA (range) was already mapped as an active, private page, the host
>    VMM may remove the private page from the TD by following the “Removing TD
>    Private Pages” sequence in the Intel TDX-module specification [3] to safely
>    block the mapping(s), flush the TLB and cache, and remove the mapping(s).
> 
> That would also provide a nice segue into the "already accepted" error below.

Ok. I will add the above detail.

> 
>> in which case TDX-Module will hold the page in a non-present "pending" state
>> until it is explicitly accepted.
>>
>> BUG() if TDACCEPTPAGE fails (except the above case)
> 
> What above case?  The code handles the case where the page was already accepted,
> but the changelog doesn't talk about that at all.

I think it meant about "already accepted" page case. With your above suggestion,
we can ignore this error. Or I can change it to,

BUG() if TDACCEPTPAGE fails (except "previously accepted page" case)

> 
>> as the guest is completely hosed if it can't access memory.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix 1/1] x86/tdx: Make DMA pages shared
  2021-05-18 22:12       ` Kuppuswamy, Sathyanarayanan
@ 2021-05-18 22:31         ` Dave Hansen
  2021-06-01  2:06           ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 22:31 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan, Sean Christopherson
  Cc: Peter Zijlstra, Andy Lutomirski, Tony Luck, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan, Dan Williams,
	Raj Ashok, linux-kernel, Kai Huang, Sean Christopherson

On 5/18/21 3:12 PM, Kuppuswamy, Sathyanarayanan wrote:
> "TDX is similar. In TDX architecture, the private guest memory is 
> encrypted, which prevents anything other than guest from
> accessing/modifying it. So to communicate with I/O devices, we need
> to create decrypted mapping and make the pages shared."

That's actually even more wrong. :(

Check out "Machine Check Architecture Background" in the TDX
architecture spec.

Modification is totally permitted in the architecture.  A host can write
all day long to guest memory.  Depending on how you use the word,
"access" can also include writes.

TDX really just prevents guests from *consuming* the gunk that an
attacker might write.

Also, don't say "decrypted".  The memory is probably still TME-enabled
and probably encrypted on the DIMM.  It's still encrypted even if
shared, it's just using the TME key, not the TD key.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v2 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18 21:19                         ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
@ 2021-05-18 23:29                           ` Dave Hansen
  2021-05-19  1:17                             ` [RFC v2-fix-v3 " Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 381+ messages in thread
From: Dave Hansen @ 2021-05-18 23:29 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

On 5/18/21 2:19 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> KVM hypercalls use the "vmcall" or "vmmcall" instructions.
> Although the ABI is similar, those instructions no longer
> function for TDX guests. Make vendor-specific TDVMCALLs
> instead of VMCALL. This enables TDX guests to run with KVM
> acting as the hypervisor. TDX guests running under other
> hypervisors will continue to use those hypervisor's
> hypercalls.

Well, I screwed this up when I typed it too, but it is:

	TDX guests running under other hypervisors will continue
	to use those hypervisors' hypercalls.

I hate how that reads, but oh well.

> Since KVM hypercall functions can be included and called
> from kernel modules, export tdx_kvm_hypercall*() functions
> to avoid symbol errors

No, you're not avoiding errors, you're exporting the symbol so it can be
*USED*.  The error comes from it not being exported.

It also helps to be specific here:  Export tdx_kvm_hypercall*() to make
the symbols visible to kvm.ko.

> [Isaku Yamahata: proposed KVM VENDOR string]
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Reviewed-by: Dave Hansen <dave.hansen@intel.com>

Also, FWIW, if you did this in the header:

+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
+}

You could get away with just exporting tdx_kvm_hypercall() instead of 4
symbols.  The rest of the code would look the same.

^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v3 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-18 23:29                           ` Dave Hansen
@ 2021-05-19  1:17                             ` Kuppuswamy Sathyanarayanan
  2021-05-19  1:20                               ` Sathyanarayanan Kuppuswamy Natarajan
  0 siblings, 1 reply; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-19  1:17 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kuppuswamy Sathyanarayanan

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

KVM hypercalls use the "vmcall" or "vmmcall" instructions.
Although the ABI is similar, those instructions no longer
function for TDX guests. Make vendor-specific TDVMCALLs
instead of VMCALL. This enables TDX guests to run with KVM
acting as the hypervisor. TDX guests running under other
hypervisors will continue to use those hypervisors'
hypercalls.

Since KVM driver can be built as a kernel module, export
tdx_kvm_hypercall*() to make the symbols visible to kvm.ko.

[Isaku Yamahata: proposed KVM VENDOR string]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/Kconfig                |  5 +++
 arch/x86/include/asm/kvm_para.h | 21 ++++++++++
 arch/x86/include/asm/tdx.h      | 68 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdcall.S        | 26 +++++++++++++
 4 files changed, 120 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9e0e0ff76bab..15e66a99dd41 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -886,6 +886,11 @@ config INTEL_TDX_GUEST
 	  run in a CPU mode that protects the confidentiality of TD memory
 	  contents and the TD’s CPU state from other software, including VMM.
 
+# This option enables KVM specific hypercalls in TDX guest.
+config INTEL_TDX_GUEST_KVM
+	def_bool y
+	depends on KVM_GUEST && INTEL_TDX_GUEST
+
 endif #HYPERVISOR_GUEST
 
 source "arch/x86/Kconfig.cpu"
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..2fa85481520b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -6,6 +6,7 @@
 #include <asm/alternative.h>
 #include <linux/interrupt.h>
 #include <uapi/asm/kvm_para.h>
+#include <asm/tdx.h>
 
 extern void kvmclock_init(void);
 
@@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 static inline long kvm_hypercall0(unsigned int nr)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall0(nr);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr)
@@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
 static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall1(nr, p1);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1)
@@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
 				  unsigned long p2)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall2(nr, p1, p2);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2)
@@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
 				  unsigned long p2, unsigned long p3)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall3(nr, p1, p2, p3);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 				  unsigned long p4)
 {
 	long ret;
+
+	if (is_tdx_guest())
+		return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8ab4067afefc..3d8d977e52f0 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -73,4 +73,72 @@ static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
+			       u64 r15, struct tdx_hypercall_output *out);
+
+/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return __tdx_hypercall_vendor_kvm(nr, 0, 0, 0, 0, NULL);
+}
+
+/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return __tdx_hypercall_vendor_kvm(nr, p1, 0, 0, 0, NULL);
+}
+
+/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+				      unsigned long p2)
+{
+	return __tdx_hypercall_vendor_kvm(nr, p1, p2, 0, 0, NULL);
+}
+
+/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3)
+{
+	return __tdx_hypercall_vendor_kvm(nr, p1, p2, p3, 0, NULL);
+}
+
+/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3,
+				      unsigned long p4)
+{
+	return __tdx_hypercall_vendor_kvm(nr, p1, p2, p3, p4, NULL);
+}
+#else
+static inline long tdx_kvm_hypercall0(unsigned int nr)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
+				      unsigned long p2)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3)
+{
+	return -ENODEV;
+}
+
+static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
+				      unsigned long p2, unsigned long p3,
+				      unsigned long p4)
+{
+	return -ENODEV;
+}
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
+
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index 2dfecdae38bb..27355fb80aeb 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -3,6 +3,7 @@
 #include <asm/asm.h>
 #include <asm/frame.h>
 #include <asm/unwind_hints.h>
+#include <asm/export.h>
 
 #include <linux/linkage.h>
 #include <linux/bits.h>
@@ -25,6 +26,8 @@
 					  TDG_R12 | TDG_R13 | \
 					  TDG_R14 | TDG_R15 )
 
+#define TDVMCALL_VENDOR_KVM		0x4d564b2e584454 /* "TDX.KVM" */
+
 /*
  * TDX guests use the TDCALL instruction to make requests to the
  * TDX module and hypercalls to the VMM. It is supported in
@@ -212,3 +215,26 @@ SYM_FUNC_START(__tdx_hypercall)
 	FRAME_END
 	retq
 SYM_FUNC_END(__tdx_hypercall)
+
+#ifdef CONFIG_INTEL_TDX_GUEST_KVM
+
+/*
+ * Helper function for KVM vendor TDVMCALLs. This assembly wrapper
+ * lets us reuse do_tdvmcall() for KVM-specific hypercalls (
+ * TDVMCALL_VENDOR_KVM).
+ */
+SYM_FUNC_START(__tdx_hypercall_vendor_kvm)
+	FRAME_BEGIN
+	/*
+	 * R10 is not part of the function call ABI, but it is a part
+	 * of the TDVMCALL ABI. So set it before making call to the
+	 * do_tdx_hypercall().
+	 */
+	movq $TDVMCALL_VENDOR_KVM, %r10
+	call do_tdx_hypercall
+	FRAME_END
+	retq
+SYM_FUNC_END(__tdx_hypercall_vendor_kvm)
+
+EXPORT_SYMBOL(__tdx_hypercall_vendor_kvm);
+#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v3 1/1] x86/tdx: Wire up KVM hypercalls
  2021-05-19  1:17                             ` [RFC v2-fix-v3 " Kuppuswamy Sathyanarayanan
@ 2021-05-19  1:20                               ` Sathyanarayanan Kuppuswamy Natarajan
  0 siblings, 0 replies; 381+ messages in thread
From: Sathyanarayanan Kuppuswamy Natarajan @ 2021-05-19  1:20 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Andi Kleen, Kirill Shutemov, Dan Williams, Raj Ashok,
	Sean Christopherson, Linux Kernel Mailing List

Sorry, I have missed to include a change log.

* Removed tdx-kvm.c and implemented tdx_kvm_hypercall*() functions in tdx.h
* Exported __tdx_hypercall_vendor_kvm() symbol for kvm.ko.
* Fixed commit log as per Dave's suggestion.
* Added Reviewed-by from Dave
* Added FRAME_BEGIN/FRAME_END for __tdx_hypercall_vendor_kvm() to fix
compiler warnings.

On Tue, May 18, 2021 at 6:17 PM Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> KVM hypercalls use the "vmcall" or "vmmcall" instructions.
> Although the ABI is similar, those instructions no longer
> function for TDX guests. Make vendor-specific TDVMCALLs
> instead of VMCALL. This enables TDX guests to run with KVM
> acting as the hypervisor. TDX guests running under other
> hypervisors will continue to use those hypervisors'
> hypercalls.
>
> Since KVM driver can be built as a kernel module, export
> tdx_kvm_hypercall*() to make the symbols visible to kvm.ko.
>
> [Isaku Yamahata: proposed KVM VENDOR string]
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Reviewed-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/Kconfig                |  5 +++
>  arch/x86/include/asm/kvm_para.h | 21 ++++++++++
>  arch/x86/include/asm/tdx.h      | 68 +++++++++++++++++++++++++++++++++
>  arch/x86/kernel/tdcall.S        | 26 +++++++++++++
>  4 files changed, 120 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9e0e0ff76bab..15e66a99dd41 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -886,6 +886,11 @@ config INTEL_TDX_GUEST
>           run in a CPU mode that protects the confidentiality of TD memory
>           contents and the TD’s CPU state from other software, including VMM.
>
> +# This option enables KVM specific hypercalls in TDX guest.
> +config INTEL_TDX_GUEST_KVM
> +       def_bool y
> +       depends on KVM_GUEST && INTEL_TDX_GUEST
> +
>  endif #HYPERVISOR_GUEST
>
>  source "arch/x86/Kconfig.cpu"
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 338119852512..2fa85481520b 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -6,6 +6,7 @@
>  #include <asm/alternative.h>
>  #include <linux/interrupt.h>
>  #include <uapi/asm/kvm_para.h>
> +#include <asm/tdx.h>
>
>  extern void kvmclock_init(void);
>
> @@ -34,6 +35,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
>  static inline long kvm_hypercall0(unsigned int nr)
>  {
>         long ret;
> +
> +       if (is_tdx_guest())
> +               return tdx_kvm_hypercall0(nr);
> +
>         asm volatile(KVM_HYPERCALL
>                      : "=a"(ret)
>                      : "a"(nr)
> @@ -44,6 +49,10 @@ static inline long kvm_hypercall0(unsigned int nr)
>  static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
>  {
>         long ret;
> +
> +       if (is_tdx_guest())
> +               return tdx_kvm_hypercall1(nr, p1);
> +
>         asm volatile(KVM_HYPERCALL
>                      : "=a"(ret)
>                      : "a"(nr), "b"(p1)
> @@ -55,6 +64,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
>                                   unsigned long p2)
>  {
>         long ret;
> +
> +       if (is_tdx_guest())
> +               return tdx_kvm_hypercall2(nr, p1, p2);
> +
>         asm volatile(KVM_HYPERCALL
>                      : "=a"(ret)
>                      : "a"(nr), "b"(p1), "c"(p2)
> @@ -66,6 +79,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
>                                   unsigned long p2, unsigned long p3)
>  {
>         long ret;
> +
> +       if (is_tdx_guest())
> +               return tdx_kvm_hypercall3(nr, p1, p2, p3);
> +
>         asm volatile(KVM_HYPERCALL
>                      : "=a"(ret)
>                      : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
> @@ -78,6 +95,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
>                                   unsigned long p4)
>  {
>         long ret;
> +
> +       if (is_tdx_guest())
> +               return tdx_kvm_hypercall4(nr, p1, p2, p3, p4);
> +
>         asm volatile(KVM_HYPERCALL
>                      : "=a"(ret)
>                      : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 8ab4067afefc..3d8d977e52f0 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -73,4 +73,72 @@ static inline void tdx_early_init(void) { };
>
>  #endif /* CONFIG_INTEL_TDX_GUEST */
>
> +#ifdef CONFIG_INTEL_TDX_GUEST_KVM
> +u64 __tdx_hypercall_vendor_kvm(u64 fn, u64 r12, u64 r13, u64 r14,
> +                              u64 r15, struct tdx_hypercall_output *out);
> +
> +/* Used by kvm_hypercall0() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall0(unsigned int nr)
> +{
> +       return __tdx_hypercall_vendor_kvm(nr, 0, 0, 0, 0, NULL);
> +}
> +
> +/* Used by kvm_hypercall1() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
> +{
> +       return __tdx_hypercall_vendor_kvm(nr, p1, 0, 0, 0, NULL);
> +}
> +
> +/* Used by kvm_hypercall2() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
> +                                     unsigned long p2)
> +{
> +       return __tdx_hypercall_vendor_kvm(nr, p1, p2, 0, 0, NULL);
> +}
> +
> +/* Used by kvm_hypercall3() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
> +                                     unsigned long p2, unsigned long p3)
> +{
> +       return __tdx_hypercall_vendor_kvm(nr, p1, p2, p3, 0, NULL);
> +}
> +
> +/* Used by kvm_hypercall4() to trigger hypercall in TDX guest */
> +static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
> +                                     unsigned long p2, unsigned long p3,
> +                                     unsigned long p4)
> +{
> +       return __tdx_hypercall_vendor_kvm(nr, p1, p2, p3, p4, NULL);
> +}
> +#else
> +static inline long tdx_kvm_hypercall0(unsigned int nr)
> +{
> +       return -ENODEV;
> +}
> +
> +static inline long tdx_kvm_hypercall1(unsigned int nr, unsigned long p1)
> +{
> +       return -ENODEV;
> +}
> +
> +static inline long tdx_kvm_hypercall2(unsigned int nr, unsigned long p1,
> +                                     unsigned long p2)
> +{
> +       return -ENODEV;
> +}
> +
> +static inline long tdx_kvm_hypercall3(unsigned int nr, unsigned long p1,
> +                                     unsigned long p2, unsigned long p3)
> +{
> +       return -ENODEV;
> +}
> +
> +static inline long tdx_kvm_hypercall4(unsigned int nr, unsigned long p1,
> +                                     unsigned long p2, unsigned long p3,
> +                                     unsigned long p4)
> +{
> +       return -ENODEV;
> +}
> +#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
> +
>  #endif /* _ASM_X86_TDX_H */
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> index 2dfecdae38bb..27355fb80aeb 100644
> --- a/arch/x86/kernel/tdcall.S
> +++ b/arch/x86/kernel/tdcall.S
> @@ -3,6 +3,7 @@
>  #include <asm/asm.h>
>  #include <asm/frame.h>
>  #include <asm/unwind_hints.h>
> +#include <asm/export.h>
>
>  #include <linux/linkage.h>
>  #include <linux/bits.h>
> @@ -25,6 +26,8 @@
>                                           TDG_R12 | TDG_R13 | \
>                                           TDG_R14 | TDG_R15 )
>
> +#define TDVMCALL_VENDOR_KVM            0x4d564b2e584454 /* "TDX.KVM" */
> +
>  /*
>   * TDX guests use the TDCALL instruction to make requests to the
>   * TDX module and hypercalls to the VMM. It is supported in
> @@ -212,3 +215,26 @@ SYM_FUNC_START(__tdx_hypercall)
>         FRAME_END
>         retq
>  SYM_FUNC_END(__tdx_hypercall)
> +
> +#ifdef CONFIG_INTEL_TDX_GUEST_KVM
> +
> +/*
> + * Helper function for KVM vendor TDVMCALLs. This assembly wrapper
> + * lets us reuse do_tdvmcall() for KVM-specific hypercalls (
> + * TDVMCALL_VENDOR_KVM).
> + */
> +SYM_FUNC_START(__tdx_hypercall_vendor_kvm)
> +       FRAME_BEGIN
> +       /*
> +        * R10 is not part of the function call ABI, but it is a part
> +        * of the TDVMCALL ABI. So set it before making call to the
> +        * do_tdx_hypercall().
> +        */
> +       movq $TDVMCALL_VENDOR_KVM, %r10
> +       call do_tdx_hypercall
> +       FRAME_END
> +       retq
> +SYM_FUNC_END(__tdx_hypercall_vendor_kvm)
> +
> +EXPORT_SYMBOL(__tdx_hypercall_vendor_kvm);
> +#endif /* CONFIG_INTEL_TDX_GUEST_KVM */
> --
> 2.25.1
>


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK
  2021-04-26 18:01 ` [RFC v2 27/32] x86/tdx: Exclude Shared bit from __PHYSICAL_MASK Kuppuswamy Sathyanarayanan
@ 2021-05-19  5:00   ` Kuppuswamy, Sathyanarayanan
  2021-05-19 16:14   ` Dave Hansen
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-19  5:00 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen, Dan Williams, Tony Luck
  Cc: Andi Kleen, Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Raj Ashok, Sean Christopherson, linux-kernel

Hi Dave,

On 4/26/21 11:01 AM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov"<kirill.shutemov@linux.intel.com>
> 
> tdx_shared_mask() returns the mask that has to be set in a page
> table entry to make page shared with VMM.
> 
> Also, note that we cannot club shared mapping configuration between
> AMD SME and Intel TDX Guest platforms in common function. SME has
> to do it very early in __startup_64() as it sets the bit on all
> memory, except what is used for communication. TDX can postpone as
> we don't need any shared mapping in very early boot.
> 
> Signed-off-by: Kirill A. Shutemov<kirill.shutemov@linux.intel.com>
> Reviewed-by: Andi Kleen<ak@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan<sathyanarayanan.kuppuswamy@linux.intel.com>

Any comments on this patch?

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-04-27 19:20               ` Dave Hansen
  2021-04-28 17:42                 ` [PATCH v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() " Kuppuswamy Sathyanarayanan
@ 2021-05-19  5:58                 ` Kuppuswamy Sathyanarayanan
  2021-05-19  6:04                   ` Kuppuswamy, Sathyanarayanan
  2021-05-19 15:31                   ` Dave Hansen
  1 sibling, 2 replies; 381+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-05-19  5:58 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel, Kuppuswamy Sathyanarayanan

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
expose guest state to the host.  This prevents the old hypercall
mechanisms from working. So to communicate with VMM, TDX
specification defines a new instruction called "tdcall".

In TDX based VM, since VMM is an untrusted entity, a intermediary
layer (TDX module) exists between host and guest to facilitate the
secure communication. And "tdcall" instruction  is used by the guest
to request services from TDX module. And a variant of "tdcall"
instruction (with specific arguments as defined by GHCI) is used by
the guest to request services from  VMM via the TDX module.

Implement common helper functions to communicate with the TDX Module
and VMM (using TDCALL instruction).
   
__tdx_hypercall()    - function can be used to request services from
		       the VMM.
__tdx_module_call()  - function can be used to communicate with the
		       TDX Module.

Also define two additional wrappers, tdx_hypercall() and
tdx_hypercall_out_r11() to cover common use cases of
__tdx_hypercall() function. Since each use case of
__tdx_module_call() is different, we don't need such wrappers for it.

Implement __tdx_module_call() and __tdx_hypercall() helper functions
in assembly.

Rationale behind choosing to use assembly over inline assembly are,

1. Since the number of lines of instructions (with comments) in
__tdx_hypercall() implementation is over 70, using inline assembly
to implement it will make it hard to read.
   
2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
TDCALL operation, if all these registers are included in in-line
assembly constraints, some of the older compilers may not
be able to meet this requirement.

Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
use the same set of argument registers. The implementation here picks
the current worst-case scenario for TDCALL (4 registers). For TDCALLs
with fewer than 4 arguments, there will end up being a few superfluous
(cheap) instructions.  But, this approach maximizes code reuse. The
same argument applies to __tdx_hypercall() function as well.

Current implementation of __tdx_hypercall() includes error handling
(ud2 on failure case) in assembly function instead of doing it in C
wrapper function. The reason behind this choice is, when adding support
for in/out instructions (refer to patch titled "x86/tdx: Handle port
I/O" in this series), we use alternative_io() to substitute in/out
instruction with  __tdx_hypercall() calls. So use of C wrappers is not
trivial in this case because the input parameters will be in the wrong
registers and it's tricky to include proper buffer code to make this
happen.

For registers used by TDCALL instruction, please check TDX GHCI
specification, sec 2.4 and 3.

https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

Originally-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since RFC v2: 
 * Renamed __tdcall()/__tdvmcall() to __tdx_module_call()/__tdx_hypercall().
 * Renamed reg offsets from TDCALL_rx to TDX_MODULE_rx.
 * Renamed reg offsets from TDVMCALL_rx to TDX_HYPERCALL_rx.
 * Renamed struct tdcall_output to struct tdx_module_output.
 * Renamed struct tdvmcall_output to struct tdx_hypercall_output.
 * Used BIT() to derive TDVMCALL_EXPOSE_REGS_MASK.
 * Removed unnecessary push/pop sequence in __tdcall() function.
 * Fixed comments as per Dave's review.

 arch/x86/include/asm/tdx.h    |  38 ++++++
 arch/x86/kernel/Makefile      |   2 +-
 arch/x86/kernel/asm-offsets.c |  22 ++++
 arch/x86/kernel/tdcall.S      | 222 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c         |  39 ++++++
 5 files changed, 322 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 69af72d08d3d..211b9d66b1b1 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,12 +8,50 @@
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 #include <asm/cpufeature.h>
+#include <linux/types.h>
+
+/*
+ * Used in __tdx_module_call() helper function to gather the
+ * output registers values of TDCALL instruction when requesting
+ * services from the TDX module. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_module_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+/*
+ * Used in __tdx_hypercall() helper function to gather the
+ * output registers values of TDCALL instruction when requesting
+ * services from the VMM. This is software only structure
+ * and not related to TDX module/VMM.
+ */
+struct tdx_hypercall_output {
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+};
 
 /* Common API to check TDX support in decompression and common kernel code. */
 bool is_tdx_guest(void);
 
 void __init tdx_early_init(void);
 
+/* Helper function used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		      struct tdx_module_output *out);
+
+/* Helper function used to request services from VMM */
+u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
+		    struct tdx_hypercall_output *out);
+
 #else // !CONFIG_INTEL_TDX_GUEST
 
 static inline bool is_tdx_guest(void)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ea111bf50691..7966c10ea8d1 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 60b9f42ce3c1..e6b3bb983992 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -23,6 +23,10 @@
 #include <xen/interface/xen.h>
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+#include <asm/tdx.h>
+#endif
+
 #ifdef CONFIG_X86_32
 # include "asm-offsets_32.c"
 #else
@@ -75,6 +79,24 @@ static void __used common(void)
 	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+	BLANK();
+	/* Offset for fields in tdcall_output */
+	OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+	OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+	OFFSET(TDX_MODULE_r8,  tdx_module_output, r8);
+	OFFSET(TDX_MODULE_r9,  tdx_module_output, r9);
+	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
+	/* Offset for fields in tdvmcall_output */
+	OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
+	OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
+	OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
+	OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
+	OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
+#endif
+
 	BLANK();
 	OFFSET(BP_scratch, boot_params, scratch);
 	OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..a67c595e4169
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,222 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+
+#define TDG_R10		BIT(10)
+#define TDG_R11		BIT(11)
+#define TDG_R12		BIT(12)
+#define TDG_R13		BIT(13)
+#define TDG_R14		BIT(14)
+#define TDG_R15		BIT(15)
+
+/*
+ * Expose registers R10-R15 to VMM. It is passed via RCX register
+ * to the TDX Module, which will be used by the TDX module to
+ * identify the list of registers exposed to VMM. Each bit in this
+ * mask represents a register ID. You can find the bit field details
+ * in TDX GHCI specification.
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK	( TDG_R10 | TDG_R11 | \
+					  TDG_R12 | TDG_R13 | \
+					  TDG_R14 | TDG_R15 )
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdx_module_call()  - Helper function used by TDX guests to request
+ * services from the TDX module (does not include VMM services).
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with the
+ * TDX module.  And if the "tdcall" operation is successful and a valid
+ * "struct tdx_module_output" pointer is available (in "out" argument),
+ * output from the TDX module is saved to the memory specified in the
+ * "out" pointer. Also the status of the "tdcall" operation is returned
+ * back to the user as a function return value.
+ *
+ * @fn  (RDI)		- TDCALL Leaf ID,    moved to RAX
+ * @rcx (RSI)		- Input parameter 1, moved to RCX
+ * @rdx (RDX)		- Input parameter 2, moved to RDX
+ * @r8  (RCX)		- Input parameter 3, moved to R8
+ * @r9  (R8)		- Input parameter 4, moved to R9
+ *
+ * @out (R9)		- struct tdx_module_output pointer
+ *			  stored temporarily in R12 (not
+ * 			  shared with the TDX module)
+ *
+ * Return status of tdcall via RAX.
+ *
+ * NOTE: This function should not be used for TDX hypercall
+ *       use cases.
+ */
+SYM_FUNC_START(__tdx_module_call)
+	FRAME_BEGIN
+
+	/*
+	 * R12 will be used as temporary storage for
+	 * struct tdx_module_output pointer. You can
+	 * find struct tdx_module_output details in
+	 * arch/x86/include/asm/tdx.h. Also note that
+	 * registers R12-R15 are not used by TDCALL
+	 * services supported by this helper function.
+	 */
+	push %r12	/* Callee saved, so preserve it */
+	mov %r9,  %r12 	/* Move output pointer to R12 */
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	mov %rdi, %rax	/* Move TDCALL Leaf ID to RAX */
+	mov %r8,  %r9	/* Move input 4 to R9 */
+	mov %rcx, %r8	/* Move input 3 to R8 */
+	mov %rsi, %rcx	/* Move input 1 to RCX */
+	/* Leave input param 2 in RDX */
+
+	tdcall
+
+	/* Check for TDCALL success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check for TDCALL output struct != NULL */
+	test %r12, %r12
+	jz 1f
+
+	/* Copy TDCALL result registers to output struct: */
+	movq %rcx, TDX_MODULE_rcx(%r12)
+	movq %rdx, TDX_MODULE_rdx(%r12)
+	movq %r8,  TDX_MODULE_r8(%r12)
+	movq %r9,  TDX_MODULE_r9(%r12)
+	movq %r10, TDX_MODULE_r10(%r12)
+	movq %r11, TDX_MODULE_r11(%r12)
+1:
+	pop %r12 /* Restore the state of R12 register */
+
+	FRAME_END
+	ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * do_tdx_hypercall()  - Helper function used by TDX guests to request
+ * services from the VMM. All requests are made via the TDX module
+ * using "TDCALL" instruction.
+ *
+ * This function is created to contain common between vendor specific
+ * and standard type tdx hypercalls. So the caller of this function had
+ * to set the TDVMCALL type in the R10 register before calling it.
+ *
+ * This function serves as a wrapper to move user call arguments to the
+ * correct registers as specified by "tdcall" ABI and shares it with VMM
+ * via the TDX module. And if the "tdcall" operation is successful and a
+ * valid "struct tdx_hypercall_output" pointer is available (in "out"
+ * argument), output from the VMM is saved to the memory specified in the
+ * "out" pointer. 
+ *
+ * @fn  (RDI)		- TDVMCALL function, moved to R11
+ * @r12 (RSI)		- Input parameter 1, moved to R12
+ * @r13 (RDX)		- Input parameter 2, moved to R13
+ * @r14 (RCX)		- Input parameter 3, moved to R14
+ * @r15 (R8)		- Input parameter 4, moved to R15
+ *
+ * @out (R9)		- struct tdx_hypercall_output pointer
+ *
+ * On successful completion, return TDX hypercall error code.
+ * If the "tdcall" operation fails, panic.
+ *
+ */
+SYM_FUNC_START_LOCAL(do_tdx_hypercall)
+	/* Save non-volatile GPRs that are exposed to the VMM. */
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+
+	/* Leave hypercall output pointer in R9, it's not clobbered by VMM */
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	xor %eax, %eax /* Move TDCALL leaf ID (TDVMCALL (0)) to RAX */
+	mov %rdi, %r11 /* Move TDVMCALL function id to R11 */
+	mov %rsi, %r12 /* Move input 1 to R12 */
+	mov %rdx, %r13 /* Move input 2 to R13 */
+	mov %rcx, %r14 /* Move input 1 to R14 */
+	mov %r8,  %r15 /* Move input 1 to R15 */
+	/* Caller of do_tdx_hypercall() will set TDVMCALL type in R10 */
+
+	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+	tdcall
+
+	/*
+	 * Check for TDCALL success: 0 - Successful, otherwise failed.
+	 * If failed, there is an issue with TDX Module which is fatal
+	 * for the guest. So panic. Also note that RAX is controlled
+	 * only by the TDX module and not exposed to VMM.
+	 */
+	test %rax, %rax
+	jnz 2f
+
+	/* Move hypercall error code to RAX to return to user */
+	mov %r10, %rax
+
+	/* Check for hypercall success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz 1f
+
+	/* Check for hypercall output struct != NULL */
+	test %r9, %r9
+	jz 1f
+
+	/* Copy hypercall result registers to output struct: */
+	movq %r11, TDX_HYPERCALL_r11(%r9)
+	movq %r12, TDX_HYPERCALL_r12(%r9)
+	movq %r13, TDX_HYPERCALL_r13(%r9)
+	movq %r14, TDX_HYPERCALL_r14(%r9)
+	movq %r15, TDX_HYPERCALL_r15(%r9)
+1:
+	/*
+	 * Zero out registers exposed to the VMM to avoid
+	 * speculative execution with VMM-controlled values.
+	 */
+	xor %r10d, %r10d
+	xor %r11d, %r11d
+	xor %r12d, %r12d
+	xor %r13d, %r13d
+	xor %r14d, %r14d
+	xor %r15d, %r15d
+
+	/* Restore non-volatile GPRs that are exposed to the VMM. */
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	ret
+2:
+	ud2
+SYM_FUNC_END(do_tdx_hypercall)
+
+/*
+ * Helper function for for standard type of TDVMCALLs. This assembly
+ * wrapper lets us reuse do_tdvmcall() for standard type of hypercalls
+ * (R10 is set as zero).
+ */
+SYM_FUNC_START(__tdx_hypercall)
+	FRAME_BEGIN
+	/*
+	 * R10 is not part of the function call ABI, but it is a part
+	 * of the TDVMCALL ABI. So set it 0 for standard type TDVMCALL
+	 * before making call to the do_tdx_hypercall().
+	 */
+	xor %r10, %r10
+	call do_tdx_hypercall
+	FRAME_END
+	retq
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 6a7193fead08..cbfefc42641e 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -1,8 +1,47 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (C) 2020 Intel Corporation */
 
+#define pr_fmt(fmt) "TDX: " fmt
+
 #include <asm/tdx.h>
 
+/*
+ * Wrapper for simple hypercalls that only return a success/error code.
+ */
+static inline u64 tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+	u64 err;
+
+	err = __tdx_hypercall(fn, r12, r13, r14, r15, NULL);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return err;
+}
+
+/*
+ * Wrapper for the semi-common case where we need single output
+ * value (R11). Callers of this function does not care about the
+ * hypercall error code (mainly for IN or MMIO usecase).
+ */
+static inline u64 tdx_hypercall_out_r11(u64 fn, u64 r12, u64 r13,
+					u64 r14, u64 r15)
+{
+
+	struct tdx_hypercall_output out = {0};
+	u64 err;
+
+	err = __tdx_hypercall(fn, r12, r13, r14, r15, &out);
+
+	if (err)
+		pr_warn_ratelimited("TDVMCALL fn:%llx failed with err:%llx\n",
+				    fn, err);
+
+	return out.r11;
+}
+
 static inline bool cpuid_has_tdx_guest(void)
 {
 	u32 eax, signature[3];
-- 
2.25.1


^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-05-19  5:58                 ` [RFC v2-fix-v1 " Kuppuswamy Sathyanarayanan
@ 2021-05-19  6:04                   ` Kuppuswamy, Sathyanarayanan
  2021-05-19 15:31                   ` Dave Hansen
  1 sibling, 0 replies; 381+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2021-05-19  6:04 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski, Dave Hansen
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

Hi Dave,

On 5/18/21 10:58 PM, Kuppuswamy Sathyanarayanan wrote:
> Guests communicate with VMMs with hypercalls. Historically, these
> are implemented using instructions that are known to cause VMEXITs
> like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
> expose guest state to the host.  This prevents the old hypercall
> mechanisms from working. So to communicate with VMM, TDX
> specification defines a new instruction called "tdcall".
> 
> In TDX based VM, since VMM is an untrusted entity, a intermediary
> layer (TDX module) exists between host and guest to facilitate the
> secure communication. And "tdcall" instruction  is used by the guest
> to request services from TDX module. And a variant of "tdcall"
> instruction (with specific arguments as defined by GHCI) is used by
> the guest to request services from  VMM via the TDX module.
> 
> Implement common helper functions to communicate with the TDX Module
> and VMM (using TDCALL instruction).
>     
> __tdx_hypercall()    - function can be used to request services from
> 		       the VMM.
> __tdx_module_call()  - function can be used to communicate with the
> 		       TDX Module.
> 
> Also define two additional wrappers, tdx_hypercall() and
> tdx_hypercall_out_r11() to cover common use cases of
> __tdx_hypercall() function. Since each use case of
> __tdx_module_call() is different, we don't need such wrappers for it.
> 
> Implement __tdx_module_call() and __tdx_hypercall() helper functions
> in assembly.
> 
> Rationale behind choosing to use assembly over inline assembly are,
> 
> 1. Since the number of lines of instructions (with comments) in
> __tdx_hypercall() implementation is over 70, using inline assembly
> to implement it will make it hard to read.
>     
> 2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
> TDCALL operation, if all these registers are included in in-line
> assembly constraints, some of the older compilers may not
> be able to meet this requirement.
> 
> Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
> use the same set of argument registers. The implementation here picks
> the current worst-case scenario for TDCALL (4 registers). For TDCALLs
> with fewer than 4 arguments, there will end up being a few superfluous
> (cheap) instructions.  But, this approach maximizes code reuse. The
> same argument applies to __tdx_hypercall() function as well.
> 
> Current implementation of __tdx_hypercall() includes error handling
> (ud2 on failure case) in assembly function instead of doing it in C
> wrapper function. The reason behind this choice is, when adding support
> for in/out instructions (refer to patch titled "x86/tdx: Handle port
> I/O" in this series), we use alternative_io() to substitute in/out
> instruction with  __tdx_hypercall() calls. So use of C wrappers is not
> trivial in this case because the input parameters will be in the wrong
> registers and it's tricky to include proper buffer code to make this
> happen.
> 
> For registers used by TDCALL instruction, please check TDX GHCI
> specification, sec 2.4 and 3.
> 
> https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
> 
> Originally-by: Sean Christopherson<seanjc@google.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan<sathyanarayanan.kuppuswamy@linux.intel.com>

I did send it as in-reply-to message id 3a7c0bba-cc43-e4ba-f7fe-43c8627c2fc2@intel.com (your
last reply mail id), but for some reason its not detected as reply to original patch
"[RFC v2 05/32] x86/tdx: Add __tdcall() and __tdvmcall() helper functions".

I am not sure whats going on, but please review as reply to original patch.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 381+ messages in thread

* Re: [RFC v2-fix-v1 1/1] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2021-05-19  5:58                 ` [RFC v2-fix-v1 " Kuppuswamy Sathyanarayanan
  2021-05-19  6:04                   ` Kuppuswamy, Sathyanarayanan
@ 2021-05-19 15:31                   ` Dave Hansen
  2021-05-19 19:09                     ` [RFC v2-fix-v2 " Kuppuswamy Sathyanarayanan
  2021-05-19 19:13                     ` [RFC v2-fix-v1 " Kuppuswamy, Sathyanarayanan
  1 sibling, 2 replies; 381+ messages in thread
From: Dave Hansen @ 2021-05-19 15:31 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, Dan Williams, Raj Ashok,
	Sean Christopherson, linux-kernel

On 5/18/21 10:58 PM, Kuppuswamy Sathyanarayanan wrote:
> Guests communicate with VMMs with hypercalls. Historically, these
> are implemented using instructions that are known to cause VMEXITs
> like vmcall, vmlaunch, etc. However, with TDX, VMEXITs no longer
> expose guest state to the host.  This prevents the old hypercall
> mechanisms from working. So to communicate with VMM, TDX
> specification defines a new instruction called "tdcall".
> 
> In TDX based VM, since VMM is an untrusted entity, a intermediary

"In a TDX-based VM..."

> layer (TDX module) exists between host and guest to facilitate the
> secure communication. And "tdcall" instruction  is used by the guest
> to request services from TDX module. And a variant of "tdcall"
> instruction (with specific arguments as defined by GHCI) is used by
> the guest to request services from  VMM via the TDX module.

I'd just say:

	TDX guests communicate with the TDX module and with the VMM
	using a new instruction: TDCALL.

The rest of that is noise.

> Implement common helper functions to communicate with the TDX Module
> and VMM (using TDCALL instruction).
>    
> __tdx_hypercall()    - function can be used to request services from
> 		       the VMM.
> __tdx_module_call()  - function can be used to communicate with the
> 		       TDX Module.

s/function can be used to//

> Also define two additional wrappers, tdx_hypercall() and
> tdx_hypercall_out_r11() to cover common use cases of
> __tdx_hypercall() function. Since each use case of
> __tdx_module_call() is different, we don't need such wrappers for it.
> 
> Implement __tdx_module_call() and __tdx_hypercall() helper functions
> in assembly.
> 
> Rationale behind choosing to use assembly over inline assembly are,
> 
> 1. Since the number of lines of instructions (with comments) in
> __tdx_hypercall() implementation is over 70, using inline assembly
> to implement it will make it hard to read.
>    
> 2. Also, since many registers (R8-R15, R[A-D]X)) will be used in
> TDCALL operation, if all these registers are included in in-line
> assembly constraints, some of the older compilers may not
> be able to meet this requirement.

Was this "older compiler" argument really the reason?

> Also, just like syscalls, not all TDVMCALL/TDCALLs use cases need to
> use the same set of argument registers. The implementation here picks
> the current worst-case scenario for TDCALL (4 registers). For TDCALLs
> with fewer than 4 arguments, there will end up being a few superfluous
> (cheap) instructions.  But, this approach maximizes code reuse. The
> same argument applies to __tdx_hypercall() function as well.
> 
> Current implementation of __tdx_hypercall() includes error handling
> (ud2 on failure case) in assembly function instead of doing it in C
> wrapper function. The reason behind this choice is, when adding support
> for in/out instructions (refer to patch titled "x86/tdx: Handle port
> I/O" in this series), we use alternative_io() to substitute in/out
> instruction with  __tdx_hypercall() calls. So use of C wrappers is not
> trivial in this case because the input parameters will be in the wrong
> registers and it's tricky to include proper buffer code to make this
> happen.
> 
> For registers used by TDCALL instruction, please check TDX GHCI
> specification, sec 2.4 and 3.
> 
> https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
> 
> Originally-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

For what it's worth, that changelog really starts to ramble after the
"rationale" part.

> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 69af72d08d3d..211b9d66b1b1 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -8,12 +8,50 @@
>  #ifdef CONFIG_INTEL_TDX_GUEST
>  
>  #include <asm/cpufeature.h>
> +#include <linux/types.h>
> +
> +/*
> + * Used in __tdx_module_call() helper function to gather the
> + * output registers values of TDCALL instruction when requesting

There's something wrong in this sentence.  This needs to be "output
register values" or "output regisers' values".

> + * services from the TDX module. This is software only structure
> + * and not related to TDX module/VMM.
> + */
> +struct tdx_module_output {
> +	u64 rcx;
> +	u64 rdx;
> +	u64 r8;
> +	u64 r9;
> +	u64 r10;
> +	u64 r11;
> +};
> +
> +/*
> + * Used in __tdx_hypercall() helper function to gather the
> + * output registers values of TDCALL instruction when requesting
> + * services from the VMM. This is software only structure
> + * and not related to TDX module/VMM.
> + */
> +struct tdx_hypercall_output {
> +	u64 r11;
> +	u64 r12;
> +	u64 r13;
> +	u64 r14;
> +	u64 r15;
> +};
>  
>  /* Common API to check TDX support in decompression and common kernel code. */
>  bool is_tdx_guest(void);
>  
>  void __init tdx_early_init(void);
>  
> +/* Helper function used to communicate with the TDX module */
> +u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> +		      struct tdx_module_output *out);
> +
> +/* Helper function used to request services from VMM */
> +u64 __tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15,
> +		    struct tdx_hypercall_output *out);
> +
>  #else // !CONFIG_INTEL_TDX_GUEST
>  
>  static inline bool is_tdx_guest(void)
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index ea111bf50691..7966c10ea8d1 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -127,7 +127,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
>  obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
>  
>  obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
> -obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
> +obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
>  
>  obj-$(CONFIG_EISA)		+= eisa.o
>  obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
> diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
> index 60b9f42ce3c1..e6b3bb983992 100644
> --- a/arch/x86/kernel/asm-offsets.c
> +++ b/arch/x86/kernel/asm-offsets.c
> @@ -23,6 +23,10 @@
>  #include <xen/interface/xen.h>
>  #endif
>  
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +#include <asm/tdx.h>
> +#endif
> +
>  #ifdef CONFIG_X86_32
>  # include "asm-offsets_32.c"
>  #else
> @@ -75,6 +79,24 @@ static void __used common(void)
>  	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
>  #endif
>  
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +	BLANK();
> +	/* Offset for fields in tdcall_output */
> +	OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
> +	OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
> +	OFFSET(TDX_MODULE_r8,  tdx_module_output, r8);
> +	OFFSET(TDX_MODULE_r9,  tdx_module_output, r9);
> +	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
> +	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
> +
> +	/* Offset for fields in tdvmcall_output */
> +	OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
> +	OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
> +	OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
> +	OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
> +	OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
> +#endif
> +
>  	BLANK();
>  	OFFSET(BP_scratch, boot_params, scratch);
>  	OFFSET(BP_secure_boot, boot_params, secure_boot);
> diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
> new file mode 100644
> index 000000000000..a67c595e4169
> --- /dev/null
> +++ b/arch/x86/kernel/tdcall.S
> @@ -0,0 +1,222 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <asm/asm-offsets.h>
> +#include <asm/asm.h>
> +#include <asm/frame.h>
> +#include <asm/unwind_hints.h>
> +
> +#include <linux/linkage.h>
> +#include <linux/bits.h>
> +
> +#define TDG_R10		BIT(10)
> +#define TDG_R11		BIT(11)
> +#define TDG_R12		BIT(12)
> +#define TDG_R13		BIT(13)
> +#define TDG_R14		BIT(14)
> +#define TDG_R15		BIT(15)
> +
> +/*
> + * Expose registers R10-R15 to VMM. It is passed via RCX register
> + * to the TDX Module, which will be used by the TDX module to
> + * identify the list of registers exposed to VMM. Each bit in this
> + * mask represents a register ID. You can find the bit field details
> + * in TDX GHCI specification.
> + */
> +#define TDVMCALL_EXPOSE_REGS_MASK	( TDG_R10 | TDG_R11 | \
> +					  TDG_R12 | TDG_R13 | \
> +					  TDG_R14 | TDG_R15 )
> +
> +/*
> + * TDX guests use the TDCALL instruction to make requests to the
> + * TDX module and hypercalls to the VMM. It is supported in
> + * Binutils >= 2.36.
> + */
> +#define tdcall .byte 0x66,0x0f,0x01,0xcc
> +
> +/*
> + * __tdx_module_call()  - Helper function used by TDX guests to request
> + * services from the TDX module (does not include VMM services).
> + *
> + * This function serves as a wrapper to move user call arguments to the
> + * correct registers as specified by "tdcall" ABI and shares it with the
> + * TDX module.  And if the "tdcall" operation is successful and a valid

It's frequently taught to never start a sentence with "And" in formal
writing.  You use it fairly frequently.  Simply removing it increase
readability, IMNHO.

> + * "struct tdx_module_output" pointer is available (in "out" argument),
> + * output from the TDX module is saved to the memory specified in the
> + * "out" pointer. Also the status of the "tdcall" operation is returned
> + * back to the user as a function return value.
> + *
> + * @fn  (RDI)		- TDCALL Leaf ID,    moved to RAX
> + * @rcx (RSI)		- Input parameter 1, moved to RCX
> + * @rdx (RDX)		- Input parameter 2, moved to RDX
> + * @r8  (RCX)		- Input parameter 3