linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv5 00/30] TDX Guest: TDX core support
@ 2022-03-02 14:27 Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 01/30] x86/tdx: Detect running as a TDX guest in early boot Kirill A. Shutemov
                   ` (29 more replies)
  0 siblings, 30 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov

Hi All,

Intel's Trust Domain Extensions (TDX) protects confidential guest VMs
from the host and physical attacks by isolating the guest register
state and by encrypting the guest memory. In TDX, a special TDX module
sits between the host and the guest, and runs in a special mode and
manages the guest/host separation.

	Please review and consider applying.

More details of TDX guests can be found in Documentation/x86/tdx.rst.

All dependencies of the patchset are in Linus' tree now.

SEV/TDX comparison:
-------------------

TDX has a lot of similarities to SEV. It enhances confidentiality
of guest memory and state (like registers) and includes a new exception
(#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
yet), TDX limits the host's ability to make changes in the guest
physical address space.

TDX/VM comparison:
------------------

Some of the key differences between TD and regular VM is,

1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
2. A new #VE exception handler is added. The TDX module injects #VE exception
   to the guest TD in cases of instructions that need to be emulated, disallowed
   MSR accesses, etc.
3. By default memory is marked as private, and TD will selectively share it with
   VMM based on need.

You can find TDX related documents in the following link.

https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html

Git tree:

https://github.com/intel/tdx.git guest-upstream

Previous version:

https://lore.kernel.org/r/20220224155630.52734-1-kirill.shutemov@linux.intel.com

Changes from v4:
  - Update comments for TDX_MODULE_CALL()
  - Clarify how TDX_SEAMCALL_VMFAILINVALID is defined
  - Updated comments in __tdx_hypercall()
  - Get rid of td_info
  - Move exc_general_protection() refactoring into a separate patch
  - Updates comments around #VE handling
  - Add hcall_func() to differenciate exit reason from hypercalls
  - Only allow hypervisor CPUID leaves to be handled with #VE
  - Update MMIO handling comments and commit message
  - Update commit messages from port I/O related pateches
  - Rename init_io_ops() to init_default_io_ops()
  - Refactor handle_io()
  - Fold warning fix from a stand along patch to patch that make the warning
    triggerable
  - Do not flush cache on entering sleep state for any virtual machine, not only TDX
  - Documentation is updated
Changes from v3:
  - Rebased on top of merged x86/coco patches
  - Sanity build-time check for TDX detection (Cyrill Gorcunov)
  - Correction in the documentation regarding #VE for CPUID
Changes from v2:
  - Move TDX-Guest-specific code under arch/x86/coco/
  - Code shared between host and guest is under arch/x86/virt/
  - Fix handling CR4.MCE for !CONFIG_X86_MCE
  - A separate patch to clarify CR0.NE situation
  - Use u8/u16/u32 for port I/O handler
  - Rework TDCALL helpers:
    + consolidation between guest and host
    + clearer interface
    + A new tdx_module_call() panic() if TDCALL fails
  - Rework MMIO handling to imporove readability
  - New generic API to deal encryption masks
  - Move tdx_early_init() before copy_bootdata() (again)
  - Rework #VE handing to share more code with #GP handler
  - Rework __set_memory_enc_pgtable() to provide proper abstruction for both
    SME/SEV and TDX cases.
  - Fix warning on build with X86_MEM_ENCRYPT=y
  - ... and more
Changes from v1:
  - Rebased to tip/master (94985da003a4).
  - Address feedback from Borislav and Josh.
  - Wire up KVM hypercalls. Needed to send IPI.
Andi Kleen (1):
  x86/tdx: Port I/O: add early boot support

Isaku Yamahata (1):
  x86/tdx: ioapic: Add shared bit for IOAPIC base address

Kirill A. Shutemov (18):
  x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers
  x86/tdx: Extend the confidential computing API to support TDX guests
  x86/tdx: Exclude shared bit from __PHYSICAL_MASK
  x86/traps: Refactor exc_general_protection()
  x86/traps: Add #VE support for TDX guest
  x86/tdx: Add HLT support for TDX guests
  x86/tdx: Add MSR support for TDX guests
  x86/tdx: Handle CPUID via #VE
  x86/tdx: Handle in-kernel MMIO
  x86: Adjust types used in port I/O helpers
  x86: Consolidate port I/O helpers
  x86/boot: Port I/O: allow to hook up alternative helpers
  x86/boot: Port I/O: add decompression-time support for TDX
  x86/boot: Set CR0.NE early and keep it set during the boot
  x86/tdx: Make pages shared in ioremap()
  x86/mm/cpa: Add support for TDX shared memory
  x86/kvm: Use bounce buffers for TD guest
  ACPICA: Avoid cache flush inside virtual machines

Kuppuswamy Sathyanarayanan (8):
  x86/tdx: Detect running as a TDX guest in early boot
  x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper
    functions
  x86/tdx: Detect TDX at early kernel decompression time
  x86/tdx: Port I/O: add runtime hypercalls
  x86/tdx: Wire up KVM hypercalls
  x86/acpi, x86/boot: Add multiprocessor wake-up support
  x86/topology: Disable CPU online/offline control for TDX guests
  Documentation/x86: Document TDX kernel architecture

Sean Christopherson (2):
  x86/boot: Add a trampoline for booting APs via firmware handoff
  x86/boot: Avoid #VE during boot for TDX platforms

 Documentation/x86/index.rst              |   1 +
 Documentation/x86/tdx.rst                | 214 ++++++++
 arch/x86/Kconfig                         |  15 +
 arch/x86/boot/a20.c                      |  14 +-
 arch/x86/boot/boot.h                     |  35 +-
 arch/x86/boot/compressed/Makefile        |   1 +
 arch/x86/boot/compressed/head_64.S       |  27 +-
 arch/x86/boot/compressed/misc.c          |  26 +-
 arch/x86/boot/compressed/misc.h          |   4 +-
 arch/x86/boot/compressed/pgtable.h       |   2 +-
 arch/x86/boot/compressed/tdcall.S        |   3 +
 arch/x86/boot/compressed/tdx.c           |  99 ++++
 arch/x86/boot/compressed/tdx.h           |  15 +
 arch/x86/boot/cpuflags.c                 |   3 +-
 arch/x86/boot/cpuflags.h                 |   1 +
 arch/x86/boot/early_serial_console.c     |  28 +-
 arch/x86/boot/io.h                       |  32 ++
 arch/x86/boot/main.c                     |   4 +
 arch/x86/boot/pm.c                       |  10 +-
 arch/x86/boot/tty.c                      |   4 +-
 arch/x86/boot/video-vga.c                |   6 +-
 arch/x86/boot/video.h                    |   8 +-
 arch/x86/coco/Makefile                   |   2 +
 arch/x86/coco/core.c                     |  14 +-
 arch/x86/coco/tdcall.S                   | 201 +++++++
 arch/x86/coco/tdx.c                      | 634 +++++++++++++++++++++++
 arch/x86/include/asm/acenv.h             |  14 +-
 arch/x86/include/asm/apic.h              |   7 +
 arch/x86/include/asm/cpufeatures.h       |   1 +
 arch/x86/include/asm/disabled-features.h |   8 +-
 arch/x86/include/asm/idtentry.h          |   4 +
 arch/x86/include/asm/io.h                |  42 +-
 arch/x86/include/asm/kvm_para.h          |  22 +
 arch/x86/include/asm/mem_encrypt.h       |   6 +-
 arch/x86/include/asm/realmode.h          |   1 +
 arch/x86/include/asm/shared/io.h         |  34 ++
 arch/x86/include/asm/shared/tdx.h        |  37 ++
 arch/x86/include/asm/tdx.h               |  90 ++++
 arch/x86/kernel/acpi/boot.c              | 118 +++++
 arch/x86/kernel/apic/apic.c              |  10 +
 arch/x86/kernel/apic/io_apic.c           |  15 +-
 arch/x86/kernel/asm-offsets.c            |  19 +
 arch/x86/kernel/head64.c                 |   7 +
 arch/x86/kernel/head_64.S                |  28 +-
 arch/x86/kernel/idt.c                    |   3 +
 arch/x86/kernel/process.c                |   4 +
 arch/x86/kernel/smpboot.c                |  12 +-
 arch/x86/kernel/traps.c                  | 138 ++++-
 arch/x86/mm/ioremap.c                    |   5 +
 arch/x86/mm/mem_encrypt.c                |   9 +-
 arch/x86/realmode/rm/header.S            |   1 +
 arch/x86/realmode/rm/trampoline_64.S     |  57 +-
 arch/x86/realmode/rm/trampoline_common.S |  12 +-
 arch/x86/realmode/rm/wakemain.c          |  14 +-
 arch/x86/virt/tdxcall.S                  |  95 ++++
 include/linux/cc_platform.h              |  10 +
 kernel/cpu.c                             |   7 +
 57 files changed, 2074 insertions(+), 159 deletions(-)
 create mode 100644 Documentation/x86/tdx.rst
 create mode 100644 arch/x86/boot/compressed/tdcall.S
 create mode 100644 arch/x86/boot/compressed/tdx.c
 create mode 100644 arch/x86/boot/compressed/tdx.h
 create mode 100644 arch/x86/boot/io.h
 create mode 100644 arch/x86/coco/tdcall.S
 create mode 100644 arch/x86/coco/tdx.c
 create mode 100644 arch/x86/include/asm/shared/io.h
 create mode 100644 arch/x86/include/asm/shared/tdx.h
 create mode 100644 arch/x86/include/asm/tdx.h
 create mode 100644 arch/x86/virt/tdxcall.S

-- 
2.34.1


^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCHv5 01/30] x86/tdx: Detect running as a TDX guest in early boot
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-04 15:43   ` Borislav Petkov
  2022-03-02 14:27 ` [PATCHv5 02/30] x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers Kirill A. Shutemov
                   ` (28 subsequent siblings)
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A . Shutemov,
	Dave Hansen

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

In preparation of extending cc_platform_has() API to support TDX guest,
use CPUID instruction to detect support for TDX guests in the early
boot code (via tdx_early_init()). Since copy_bootdata() is the first
user of cc_platform_has() API, detect the TDX guest status before it.

Define a synthetic feature flag (X86_FEATURE_TDX_GUEST) and set this
bit in a valid TDX guest platform.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/Kconfig                         | 12 ++++++++++++
 arch/x86/coco/Makefile                   |  2 ++
 arch/x86/coco/tdx.c                      | 23 +++++++++++++++++++++++
 arch/x86/include/asm/cpufeatures.h       |  1 +
 arch/x86/include/asm/disabled-features.h |  8 +++++++-
 arch/x86/include/asm/tdx.h               | 21 +++++++++++++++++++++
 arch/x86/kernel/head64.c                 |  4 ++++
 7 files changed, 70 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/coco/tdx.c
 create mode 100644 arch/x86/include/asm/tdx.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 57a4e0285a80..c346d66b51fc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -880,6 +880,18 @@ config ACRN_GUEST
 	  IOT with small footprint and real-time features. More details can be
 	  found in https://projectacrn.org/.
 
+config INTEL_TDX_GUEST
+	bool "Intel TDX (Trust Domain Extensions) - Guest Support"
+	depends on X86_64 && CPU_SUP_INTEL
+	depends on X86_X2APIC
+	help
+	  Support running as a guest under Intel TDX.  Without this support,
+	  the guest kernel can not boot or run under TDX.
+	  TDX includes memory encryption and integrity capabilities
+	  which protect the confidentiality and integrity of guest
+	  memory contents and CPU state. TDX guests are protected from
+	  some attacks from the VMM.
+
 endif #HYPERVISOR_GUEST
 
 source "arch/x86/Kconfig.cpu"
diff --git a/arch/x86/coco/Makefile b/arch/x86/coco/Makefile
index c1ead00017a7..32f4c6e6f199 100644
--- a/arch/x86/coco/Makefile
+++ b/arch/x86/coco/Makefile
@@ -4,3 +4,5 @@ KASAN_SANITIZE_core.o	:= n
 CFLAGS_core.o		+= -fno-stack-protector
 
 obj-y += core.o
+
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
new file mode 100644
index 000000000000..00898e3eb77f
--- /dev/null
+++ b/arch/x86/coco/tdx.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2021-2022 Intel Corporation */
+
+#undef pr_fmt
+#define pr_fmt(fmt)     "tdx: " fmt
+
+#include <linux/cpufeature.h>
+#include <asm/tdx.h>
+
+void __init tdx_early_init(void)
+{
+	u32 eax, sig[3];
+
+	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
+
+	BUILD_BUG_ON(sizeof(sig) != sizeof(TDX_IDENT) - 1);
+	if (memcmp(TDX_IDENT, sig, sizeof(sig)))
+		return;
+
+	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+	pr_info("Guest detected\n");
+}
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 5cd22090e53d..cacc8dde854b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -238,6 +238,7 @@
 #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
 #define X86_FEATURE_PVUNLOCK		( 8*32+20) /* "" PV unlock function */
 #define X86_FEATURE_VCPUPREEMPT		( 8*32+21) /* "" PV vcpu_is_preempted function */
+#define X86_FEATURE_TDX_GUEST		( 8*32+22) /* Intel Trust Domain Extensions Guest */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE		( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 1231d63f836d..b37de8268c9a 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -68,6 +68,12 @@
 # define DISABLE_SGX	(1 << (X86_FEATURE_SGX & 31))
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+# define DISABLE_TDX_GUEST	0
+#else
+# define DISABLE_TDX_GUEST	(1 << (X86_FEATURE_TDX_GUEST & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -79,7 +85,7 @@
 #define DISABLED_MASK5	0
 #define DISABLED_MASK6	0
 #define DISABLED_MASK7	(DISABLE_PTI)
-#define DISABLED_MASK8	0
+#define DISABLED_MASK8	(DISABLE_TDX_GUEST)
 #define DISABLED_MASK9	(DISABLE_SMAP|DISABLE_SGX)
 #define DISABLED_MASK10	0
 #define DISABLED_MASK11	0
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
new file mode 100644
index 000000000000..ba8042ce61c2
--- /dev/null
+++ b/arch/x86/include/asm/tdx.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2021-2022 Intel Corporation */
+#ifndef _ASM_X86_TDX_H
+#define _ASM_X86_TDX_H
+
+#include <linux/init.h>
+
+#define TDX_CPUID_LEAF_ID	0x21
+#define TDX_IDENT		"IntelTDX    "
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+
+void __init tdx_early_init(void);
+
+#else
+
+static inline void tdx_early_init(void) { };
+
+#endif /* CONFIG_INTEL_TDX_GUEST */
+
+#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 4f5ecbbaae77..6dff50c3edd6 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -40,6 +40,7 @@
 #include <asm/extable.h>
 #include <asm/trapnr.h>
 #include <asm/sev.h>
+#include <asm/tdx.h>
 
 /*
  * Manage page tables very early on.
@@ -514,6 +515,9 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 
 	idt_setup_early_handler();
 
+	/* Needed before cc_platform_has() can be used for TDX */
+	tdx_early_init();
+
 	copy_bootdata(__va(real_mode_data));
 
 	/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 02/30] x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 01/30] x86/tdx: Detect running as a TDX guest in early boot Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-08 19:56   ` Dave Hansen
  2022-03-10 12:32   ` Borislav Petkov
  2022-03-02 14:27 ` [PATCHv5 03/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions Kirill A. Shutemov
                   ` (27 subsequent siblings)
  29 siblings, 2 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov

Secure Arbitration Mode (SEAM) is an extension of VMX architecture.  It
defines a new VMX root operation (SEAM VMX root) and a new VMX non-root
operation (SEAM VMX non-root) which are both isolated from the legacy
VMX operation where the host kernel runs.

A CPU-attested software module (called 'TDX module') runs in SEAM VMX
root to manage and protect VMs running in SEAM VMX non-root.  SEAM VMX
root is also used to host another CPU-attested software module (called
'P-SEAMLDR') to load and update the TDX module.

Host kernel transits to either P-SEAMLDR or TDX module via the new
SEAMCALL instruction, which is essentially a VMExit from VMX root mode
to SEAM VMX root mode.  SEAMCALLs are leaf functions defined by
P-SEAMLDR and TDX module around the new SEAMCALL instruction.

A guest kernel can also communicate with TDX module via TDCALL
instruction.

TDCALLs and SEAMCALLs use an ABI different from the x86-64 system-v ABI.
RAX is used to carry both the SEAMCALL leaf function number (input) and
the completion status (output).  Additional GPRs (RCX, RDX, R8-R11) may
be further used as both input and output operands in individual leaf.

TDCALL and SEAMCALL share the same ABI and require the largely same
code to pass down arguments and retrieve results.

Define an assembly macro that can be used to implement C wrapper for
both TDCALL and SEAMCALL.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/tdx.h    | 28 +++++++++++
 arch/x86/kernel/asm-offsets.c |  9 ++++
 arch/x86/virt/tdxcall.S       | 95 +++++++++++++++++++++++++++++++++++
 3 files changed, 132 insertions(+)
 create mode 100644 arch/x86/virt/tdxcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index ba8042ce61c2..e5ff8ed59adf 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,6 +8,33 @@
 #define TDX_CPUID_LEAF_ID	0x21
 #define TDX_IDENT		"IntelTDX    "
 
+/*
+ * SW-defined error codes.
+ *
+ * Bits 47:40 == 0xFF indicate Reserved status code class that never used by
+ * TDX module.
+ */
+#define TDX_ERROR			(1UL << 63)
+#define TDX_SW_ERROR			(TDX_ERROR | GENMASK_ULL(40, 47))
+#define TDX_SEAMCALL_VMFAILINVALID	(TDX_SW_ERROR | 0xFFFF0000ULL)
+
+#ifndef __ASSEMBLY__
+
+/*
+ * Used to gather the output registers values of the TDCALL and SEAMCALL
+ * instructions when requesting services from the TDX module.
+ *
+ * This is a software only structure and not part of the TDX module/VMM ABI.
+ */
+struct tdx_module_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 void __init tdx_early_init(void);
@@ -18,4 +45,5 @@ static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
+#endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 9fb0a2f8b62a..7dca52f5cfc6 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -18,6 +18,7 @@
 #include <asm/bootparam.h>
 #include <asm/suspend.h>
 #include <asm/tlbflush.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_XEN
 #include <xen/interface/xen.h>
@@ -65,6 +66,14 @@ static void __used common(void)
 	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
+	BLANK();
+	OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+	OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+	OFFSET(TDX_MODULE_r8,  tdx_module_output, r8);
+	OFFSET(TDX_MODULE_r9,  tdx_module_output, r9);
+	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
 	BLANK();
 	OFFSET(BP_scratch, boot_params, scratch);
 	OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/virt/tdxcall.S b/arch/x86/virt/tdxcall.S
new file mode 100644
index 000000000000..b9ec23c95fd5
--- /dev/null
+++ b/arch/x86/virt/tdxcall.S
@@ -0,0 +1,95 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/tdx.h>
+
+/*
+ * TDX_MODULE_CALL - common helper macro for both
+ *                 TDCALL and SEAMCALL instructions.
+ *
+ * TDCALL   - used by TDX guests to make requests to the
+ *            TDX module and hypercalls to the VMM.
+ * SEAMCALL - used by TDX hosts to make requests to the
+ *            TDX module.
+ *
+ * Both instruction are supported in Binutils >= 2.36.
+ */
+#define tdcall		.byte 0x66,0x0f,0x01,0xcc
+#define seamcall	.byte 0x66,0x0f,0x01,0xcf
+
+.macro TDX_MODULE_CALL host:req
+	/*
+	 * R12 will be used as temporary storage for struct tdx_module_output
+	 * pointer. Since R12-R15 registers are not used by TDCALL/SEAMCALL
+	 * services supported by this function, it can be reused.
+	 */
+
+	/* Callee saved, so preserve it */
+	push %r12
+
+	/*
+	 * Push output pointer to stack.
+	 * After the operation, it will be fetched into R12 register.
+	 */
+	push %r9
+
+	/* Mangle function call ABI into TDCALL/SEAMCALL ABI: */
+	/* Move Leaf ID to RAX */
+	mov %rdi, %rax
+	/* Move input 4 to R9 */
+	mov %r8,  %r9
+	/* Move input 3 to R8 */
+	mov %rcx, %r8
+	/* Move input 1 to RCX */
+	mov %rsi, %rcx
+	/* Leave input param 2 in RDX */
+
+	.if \host
+	seamcall
+	/*
+	 * SEAMCALL instruction is essentially a VMExit from VMX root
+	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
+	 * that the targeted SEAM firmware is not loaded or disabled,
+	 * or P-SEAMLDR is busy with another SEAMCALL.  %rax is not
+	 * changed in this case.
+	 *
+	 * Set %rax to TDX_SEAMCALL_VMFAILINVALID for VMfailInvalid.
+	 * This value will never be used as actual SEAMCALL error code as
+	 * it is from the Reserved status code class.
+	 */
+	jnc .Lno_vmfailinvalid
+	mov $TDX_SEAMCALL_VMFAILINVALID, %rax
+.Lno_vmfailinvalid:
+	.else
+	tdcall
+	.endif
+
+	/*
+	 * Fetch output pointer from stack to R12 (It is used
+	 * as temporary storage)
+	 */
+	pop %r12
+
+	/* Check for success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz .Lno_output_struct
+
+	/*
+	 * Since this function can be initiated without an output pointer,
+	 * check if caller provided an output struct before storing
+	 * output registers.
+	 */
+	test %r12, %r12
+	jz .Lno_output_struct
+
+	/* Copy result registers to output struct: */
+	movq %rcx, TDX_MODULE_rcx(%r12)
+	movq %rdx, TDX_MODULE_rdx(%r12)
+	movq %r8,  TDX_MODULE_r8(%r12)
+	movq %r9,  TDX_MODULE_r9(%r12)
+	movq %r10, TDX_MODULE_r10(%r12)
+	movq %r11, TDX_MODULE_r11(%r12)
+
+.Lno_output_struct:
+	/* Restore the state of R12 register */
+	pop %r12
+.endm
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 03/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 01/30] x86/tdx: Detect running as a TDX guest in early boot Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 02/30] x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-08 20:03   ` Dave Hansen
  2022-03-10 15:30   ` Borislav Petkov
  2022-03-02 14:27 ` [PATCHv5 04/30] x86/tdx: Extend the confidential computing API to support TDX guests Kirill A. Shutemov
                   ` (26 subsequent siblings)
  29 siblings, 2 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
expose the guest state to the host. This prevents the old hypercall
mechanisms from working. So, to communicate with VMM, TDX
specification defines a new instruction called TDCALL.

In a TDX based VM, since the VMM is an untrusted entity, an intermediary
layer -- TDX module -- facilitates secure communication between the host
and the guest. TDX module is loaded like a firmware into a special CPU
mode called SEAM. TDX guests communicate with the TDX module using the
TDCALL instruction.

A guest uses TDCALL to communicate with both the TDX module and VMM.
The value of the RAX register when executing the TDCALL instruction is
used to determine the TDCALL type. A variant of TDCALL used to communicate
with the VMM is called TDVMCALL.

Add generic interfaces to communicate with the TDX module and VMM
(using the TDCALL instruction).

__tdx_hypercall()    - Used by the guest to request services from the
		       VMM (via TDVMCALL).
__tdx_module_call()  - Used to communicate with the TDX module (via
		       TDCALL).

Also define an additional wrapper _tdx_hypercall(), which adds error
handling support for the TDCALL failure.

The __tdx_module_call() and __tdx_hypercall() helper functions are
implemented in assembly in a .S file.  The TDCALL ABI requires
shuffling arguments in and out of registers, which proved to be
awkward with inline assembly.

Just like syscalls, not all TDVMCALL use cases need to use the same
number of argument registers. The implementation here picks the current
worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
than 4 arguments, there will end up being a few superfluous (cheap)
instructions. But, this approach maximizes code reuse.

For registers used by the TDCALL instruction, please check TDX GHCI
specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
Interface".

Based on previous patch by Sean Christopherson.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/Makefile        |   2 +-
 arch/x86/coco/tdcall.S        | 188 ++++++++++++++++++++++++++++++++++
 arch/x86/coco/tdx.c           |  18 ++++
 arch/x86/include/asm/tdx.h    |  27 +++++
 arch/x86/kernel/asm-offsets.c |  10 ++
 5 files changed, 244 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/coco/tdcall.S

diff --git a/arch/x86/coco/Makefile b/arch/x86/coco/Makefile
index 32f4c6e6f199..14af5412e3cd 100644
--- a/arch/x86/coco/Makefile
+++ b/arch/x86/coco/Makefile
@@ -5,4 +5,4 @@ CFLAGS_core.o		+= -fno-stack-protector
 
 obj-y += core.o
 
-obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o tdcall.o
diff --git a/arch/x86/coco/tdcall.S b/arch/x86/coco/tdcall.S
new file mode 100644
index 000000000000..4767e0b5f0d9
--- /dev/null
+++ b/arch/x86/coco/tdcall.S
@@ -0,0 +1,188 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+#include <linux/errno.h>
+
+#include "../virt/tdxcall.S"
+
+/*
+ * Bitmasks of exposed registers (with VMM).
+ */
+#define TDX_R10		BIT(10)
+#define TDX_R11		BIT(11)
+#define TDX_R12		BIT(12)
+#define TDX_R13		BIT(13)
+#define TDX_R14		BIT(14)
+#define TDX_R15		BIT(15)
+
+/*
+ * These registers are clobbered to hold arguments for each
+ * TDVMCALL. They are safe to expose to the VMM.
+ * Each bit in this mask represents a register ID. Bit field
+ * details can be found in TDX GHCI specification, section
+ * titled "TDCALL [TDG.VP.VMCALL] leaf".
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK	( TDX_R10 | TDX_R11 | \
+					  TDX_R12 | TDX_R13 | \
+					  TDX_R14 | TDX_R15 )
+
+/*
+ * __tdx_module_call()  - Used by TDX guests to request services from
+ * the TDX module (does not include VMM services).
+ *
+ * Transforms function call register arguments into the TDCALL
+ * register ABI.  After TDCALL operation, TDX module output is saved
+ * in @out (if it is provided by the user)
+ *
+ *-------------------------------------------------------------------------
+ * TDCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX                 - TDCALL Leaf number.
+ * RCX,RDX,R8-R9       - TDCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX                 - TDCALL instruction error code.
+ * RCX,RDX,R8-R11      - TDCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_module_call() function ABI:
+ *
+ * @fn  (RDI)          - TDCALL Leaf ID,    moved to RAX
+ * @rcx (RSI)          - Input parameter 1, moved to RCX
+ * @rdx (RDX)          - Input parameter 2, moved to RDX
+ * @r8  (RCX)          - Input parameter 3, moved to R8
+ * @r9  (R8)           - Input parameter 4, moved to R9
+ *
+ * @out (R9)           - struct tdx_module_output pointer
+ *                       stored temporarily in R12 (not
+ *                       shared with the TDX module). It
+ *                       can be NULL.
+ *
+ * Return status of TDCALL via RAX.
+ */
+SYM_FUNC_START(__tdx_module_call)
+	FRAME_BEGIN
+	TDX_MODULE_CALL host=0
+	FRAME_END
+	ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * __tdx_hypercall() - Make hypercalls to a TDX VMM.
+ *
+ * Transforms values in  function call argument struct tdx_hypercall_args @args
+ * into the TDCALL register ABI. After TDCALL operation, VMM output is saved
+ * back in @args.
+ *
+ *-------------------------------------------------------------------------
+ * TD VMCALL ABI:
+ *-------------------------------------------------------------------------
+ *
+ * Input Registers:
+ *
+ * RAX                 - TDCALL instruction leaf number (0 - TDG.VP.VMCALL)
+ * RCX                 - BITMAP which controls which part of TD Guest GPR
+ *                       is passed as-is to the VMM and back.
+ * R10                 - Set 0 to indicate TDCALL follows standard TDX ABI
+ *                       specification. Non zero value indicates vendor
+ *                       specific ABI.
+ * R11                 - VMCALL sub function number
+ * RBX, RBP, RDI, RSI  - Used to pass VMCALL sub function specific arguments.
+ * R8-R9, R12-R15      - Same as above.
+ *
+ * Output Registers:
+ *
+ * RAX                 - TDCALL instruction status (Not related to hypercall
+ *                        output).
+ * R10                 - Hypercall output error code.
+ * R11-R15             - Hypercall sub function specific output values.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_hypercall() function ABI:
+ *
+ * @args  (RDI)        - struct tdx_hypercall_args for input and output
+ * @flags (RSI)        - TDX_HCALL_* flags
+ *
+ * On successful completion, return the hypercall error code.
+ */
+SYM_FUNC_START(__tdx_hypercall)
+	FRAME_BEGIN
+
+	/* Save callee-saved GPRs as mandated by the x86_64 ABI */
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	/* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */
+	xor %eax, %eax
+
+	/* Copy hypercall registers from arg struct: */
+	movq TDX_HYPERCALL_r10(%rdi), %r10
+	movq TDX_HYPERCALL_r11(%rdi), %r11
+	movq TDX_HYPERCALL_r12(%rdi), %r12
+	movq TDX_HYPERCALL_r13(%rdi), %r13
+	movq TDX_HYPERCALL_r14(%rdi), %r14
+	movq TDX_HYPERCALL_r15(%rdi), %r15
+
+	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+	tdcall
+
+	/*
+	 * RAX==0 indicates a failure of the TDVMCALL mechanism itself and that
+	 * something has gone horribly wrong with the TDX module.
+	 *
+	 * The return status of the hypercall operation is in a separate
+	 * register (in R10). Hypercall errors are a part of normal operation
+	 * and are handled by callers.
+	 */
+	testq %rax, %rax
+	jne .Lpanic
+
+	/* TDVMCALL leaf return code is in R10 */
+	movq %r10, %rax
+
+	/* Copy hypercall result registers to arg struct if needed */
+	testq $TDX_HCALL_HAS_OUTPUT, %rsi
+	jz .Lout
+
+	movq %r10, TDX_HYPERCALL_r10(%rdi)
+	movq %r11, TDX_HYPERCALL_r11(%rdi)
+	movq %r12, TDX_HYPERCALL_r12(%rdi)
+	movq %r13, TDX_HYPERCALL_r13(%rdi)
+	movq %r14, TDX_HYPERCALL_r14(%rdi)
+	movq %r15, TDX_HYPERCALL_r15(%rdi)
+.Lout:
+	/*
+	 * Zero out registers exposed to the VMM to avoid speculative execution
+	 * with VMM-controlled values. This needs to include all registers
+	 * present in TDVMCALL_EXPOSE_REGS_MASK (except R12-R15). R12-R15
+	 * context will be restored.
+	 */
+	xor %r10d, %r10d
+	xor %r11d, %r11d
+
+	/* Restore callee-saved GPRs as mandated by the x86_64 ABI */
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	FRAME_END
+
+	retq
+.Lpanic:
+	ud2
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 00898e3eb77f..17365fd40ba2 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -7,6 +7,24 @@
 #include <linux/cpufeature.h>
 #include <asm/tdx.h>
 
+/*
+ * Wrapper for standard use of __tdx_hypercall with no output aside from
+ * return code.
+ */
+static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = fn,
+		.r12 = r12,
+		.r13 = r13,
+		.r14 = r14,
+		.r15 = r15,
+	};
+
+	return __tdx_hypercall(&args, 0);
+}
+
 void __init tdx_early_init(void)
 {
 	u32 eax, sig[3];
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index e5ff8ed59adf..003c4d101297 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -3,11 +3,16 @@
 #ifndef _ASM_X86_TDX_H
 #define _ASM_X86_TDX_H
 
+#include <linux/bits.h>
 #include <linux/init.h>
 
 #define TDX_CPUID_LEAF_ID	0x21
 #define TDX_IDENT		"IntelTDX    "
 
+#define TDX_HYPERCALL_STANDARD  0
+
+#define TDX_HCALL_HAS_OUTPUT	BIT(0)
+
 /*
  * SW-defined error codes.
  *
@@ -35,10 +40,32 @@ struct tdx_module_output {
 	u64 r11;
 };
 
+/*
+ * Used in __tdx_hypercall() to pass down and get back registers' values of
+ * the TDCALL instruction when requesting services from the VMM.
+ *
+ * This is a software only structure and not part of the TDX module/VMM ABI.
+ */
+struct tdx_hypercall_args {
+	u64 r10;
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+};
+
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 void __init tdx_early_init(void);
 
+/* Used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		      struct tdx_module_output *out);
+
+/* Used to request services from the VMM */
+u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
+
 #else
 
 static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 7dca52f5cfc6..0b465e7d0a2f 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -74,6 +74,16 @@ static void __used common(void)
 	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
 	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+	BLANK();
+	OFFSET(TDX_HYPERCALL_r10, tdx_hypercall_args, r10);
+	OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_args, r11);
+	OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_args, r12);
+	OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_args, r13);
+	OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_args, r14);
+	OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_args, r15);
+#endif
+
 	BLANK();
 	OFFSET(BP_scratch, boot_params, scratch);
 	OFFSET(BP_secure_boot, boot_params, secure_boot);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 04/30] x86/tdx: Extend the confidential computing API to support TDX guests
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 03/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-08 20:17   ` Dave Hansen
  2022-03-02 14:27 ` [PATCHv5 05/30] x86/tdx: Exclude shared bit from __PHYSICAL_MASK Kirill A. Shutemov
                   ` (25 subsequent siblings)
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov

Confidential Computing (CC) features (like string I/O unroll support,
memory encryption/decryption support, etc) are conditionally enabled
in the kernel using cc_platform_has() API. Since TDX guests also need
to use these CC features, extend cc_platform_has() API and add TDX
guest-specific CC attributes support.

Like AMD SME/SEV, TDX uses a bit in the page table entry to indicate
encryption status of the page, but the polarity of the mask is
opposite to AMD: if the bit is set the page is accessible to VMM.

Details about which bit in the page table entry to be used to indicate
shared/private state can be determined by using the TDINFO TDCALL.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig     |  1 +
 arch/x86/coco/core.c |  4 ++++
 arch/x86/coco/tdx.c  | 38 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 43 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c346d66b51fc..93e67842e369 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
 	bool "Intel TDX (Trust Domain Extensions) - Guest Support"
 	depends on X86_64 && CPU_SUP_INTEL
 	depends on X86_X2APIC
+	select ARCH_HAS_CC_PLATFORM
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index fc1365dd927e..9113baebbfd2 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -90,6 +90,8 @@ u64 cc_mkenc(u64 val)
 	switch (vendor) {
 	case CC_VENDOR_AMD:
 		return val | cc_mask;
+	case CC_VENDOR_INTEL:
+		return val & ~cc_mask;
 	default:
 		return val;
 	}
@@ -100,6 +102,8 @@ u64 cc_mkdec(u64 val)
 	switch (vendor) {
 	case CC_VENDOR_AMD:
 		return val & ~cc_mask;
+	case CC_VENDOR_INTEL:
+		return val | cc_mask;
 	default:
 		return val;
 	}
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 17365fd40ba2..912ef12e434e 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -5,8 +5,12 @@
 #define pr_fmt(fmt)     "tdx: " fmt
 
 #include <linux/cpufeature.h>
+#include <asm/coco.h>
 #include <asm/tdx.h>
 
+/* TDX module Call Leaf IDs */
+#define TDX_GET_INFO			1
+
 /*
  * Wrapper for standard use of __tdx_hypercall with no output aside from
  * return code.
@@ -25,8 +29,32 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
 	return __tdx_hypercall(&args, 0);
 }
 
+static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+				   struct tdx_module_output *out)
+{
+	if (__tdx_module_call(fn, rcx, rdx, r8, r9, out))
+		panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
+}
+
+static void get_info(unsigned int *gpa_width)
+{
+	struct tdx_module_output out;
+
+	/*
+	 * TDINFO TDX module call is used to get the TD execution environment
+	 * information like GPA width, number of available vcpus, debug mode
+	 * information, etc. More details about the ABI can be found in TDX
+	 * Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
+	 * [TDG.VP.INFO].
+	 */
+	tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
+
+	*gpa_width = out.rcx & GENMASK(5, 0);
+}
+
 void __init tdx_early_init(void)
 {
+	unsigned int gpa_width;
 	u32 eax, sig[3];
 
 	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
@@ -37,5 +65,15 @@ void __init tdx_early_init(void)
 
 	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
+	get_info(&gpa_width);
+
+	cc_set_vendor(CC_VENDOR_INTEL);
+
+	/*
+	 * The highest bit of a guest physical address is the "sharing" bit.
+	 * Set it for shared pages and clear it for private pages.
+	 */
+	cc_set_mask(BIT_ULL(gpa_width - 1));
+
 	pr_info("Guest detected\n");
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 05/30] x86/tdx: Exclude shared bit from __PHYSICAL_MASK
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 04/30] x86/tdx: Extend the confidential computing API to support TDX guests Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 06/30] x86/traps: Refactor exc_general_protection() Kirill A. Shutemov
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov

In TDX guests, by default memory is protected from host access. If a
guest needs to communicate with the VMM (like the I/O use case), it uses
a single bit in the physical address to communicate the protected/shared
attribute of the given page.

In the x86 ARCH code, __PHYSICAL_MASK macro represents the width of the
physical address in the given architecture. It is used in creating
physical PAGE_MASK for address bits in the kernel. Since in TDX guest,
a single bit is used as metadata, it needs to be excluded from valid
physical address bits to avoid using incorrect addresses bits in the
kernel.

Enable DYNAMIC_PHYSICAL_MASK to support updating the __PHYSICAL_MASK.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/Kconfig    | 1 +
 arch/x86/coco/tdx.c | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 93e67842e369..d2f45e58e846 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -885,6 +885,7 @@ config INTEL_TDX_GUEST
 	depends on X86_64 && CPU_SUP_INTEL
 	depends on X86_X2APIC
 	select ARCH_HAS_CC_PLATFORM
+	select DYNAMIC_PHYSICAL_MASK
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 912ef12e434e..34818dc31248 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -69,6 +69,14 @@ void __init tdx_early_init(void)
 
 	cc_set_vendor(CC_VENDOR_INTEL);
 
+	/*
+	 * All bits above GPA width are reserved and kernel treats shared bit
+	 * as flag, not as part of physical address.
+	 *
+	 * Adjust physical mask to only cover valid GPA bits.
+	 */
+	physical_mask &= GENMASK_ULL(gpa_width - 2, 0);
+
 	/*
 	 * The highest bit of a guest physical address is the "sharing" bit.
 	 * Set it for shared pages and clear it for private pages.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 06/30] x86/traps: Refactor exc_general_protection()
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 05/30] x86/tdx: Exclude shared bit from __PHYSICAL_MASK Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-08 20:18   ` Dave Hansen
  2022-03-02 14:27 ` [PATCHv5 07/30] x86/traps: Add #VE support for TDX guest Kirill A. Shutemov
                   ` (23 subsequent siblings)
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov

TDX brings a new exception -- Virtualization Exception (#VE). Handling
of #VE structurally very similar to handling #GP.

Extract two helpers from exc_general_protection() that can be reused for
handling #VE.

No functional changes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/traps.c | 57 ++++++++++++++++++++++++-----------------
 1 file changed, 34 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 7ef00dee35be..733b6490523c 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -611,13 +611,43 @@ static bool try_fixup_enqcmd_gp(void)
 #endif
 }
 
+static bool gp_try_fixup_and_notify(struct pt_regs *regs, int trapnr,
+				    unsigned long error_code, const char *str)
+{
+	int ret;
+
+	if (fixup_exception(regs, trapnr, error_code, 0))
+		return true;
+
+	current->thread.error_code = error_code;
+	current->thread.trap_nr = trapnr;
+
+	/*
+	 * To be potentially processing a kprobe fault and to trust the result
+	 * from kprobe_running(), we have to be non-preemptible.
+	 */
+	if (!preemptible() && kprobe_running() &&
+	    kprobe_fault_handler(regs, trapnr))
+		return true;
+
+	ret = notify_die(DIE_GPF, str, regs, error_code, trapnr, SIGSEGV);
+	return ret == NOTIFY_STOP;
+}
+
+static void gp_user_force_sig_segv(struct pt_regs *regs, int trapnr,
+				   unsigned long error_code, const char *str)
+{
+	current->thread.error_code = error_code;
+	current->thread.trap_nr = trapnr;
+	show_signal(current, SIGSEGV, "", str, regs, error_code);
+	force_sig(SIGSEGV);
+}
+
 DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 {
 	char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
 	enum kernel_gp_hint hint = GP_NO_HINT;
-	struct task_struct *tsk;
 	unsigned long gp_addr;
-	int ret;
 
 	if (user_mode(regs) && try_fixup_enqcmd_gp())
 		return;
@@ -636,40 +666,21 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 		return;
 	}
 
-	tsk = current;
-
 	if (user_mode(regs)) {
 		if (fixup_iopl_exception(regs))
 			goto exit;
 
-		tsk->thread.error_code = error_code;
-		tsk->thread.trap_nr = X86_TRAP_GP;
-
 		if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0))
 			goto exit;
 
-		show_signal(tsk, SIGSEGV, "", desc, regs, error_code);
-		force_sig(SIGSEGV);
+		gp_user_force_sig_segv(regs, X86_TRAP_GP, error_code, desc);
 		goto exit;
 	}
 
 	if (fixup_exception(regs, X86_TRAP_GP, error_code, 0))
 		goto exit;
 
-	tsk->thread.error_code = error_code;
-	tsk->thread.trap_nr = X86_TRAP_GP;
-
-	/*
-	 * To be potentially processing a kprobe fault and to trust the result
-	 * from kprobe_running(), we have to be non-preemptible.
-	 */
-	if (!preemptible() &&
-	    kprobe_running() &&
-	    kprobe_fault_handler(regs, X86_TRAP_GP))
-		goto exit;
-
-	ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV);
-	if (ret == NOTIFY_STOP)
+	if (gp_try_fixup_and_notify(regs, X86_TRAP_GP, error_code, desc))
 		goto exit;
 
 	if (error_code)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 07/30] x86/traps: Add #VE support for TDX guest
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 06/30] x86/traps: Refactor exc_general_protection() Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-08 20:29   ` Dave Hansen
  2022-03-02 14:27 ` [PATCHv5 08/30] x86/tdx: Add HLT support for TDX guests Kirill A. Shutemov
                   ` (22 subsequent siblings)
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov,
	Sean Christopherson

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:

 * Specific instructions (WBINVD, for example)
 * Specific MSR accesses
 * Specific CPUID leaf accesses
 * Access to specific guest physical addresses

Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.

For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.

Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.

During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.

TDGETVEINFO retrieves the #VE info from the TDX module, which also
clears the "#VE valid" flag.  This must be done before anything else as
any #VE that occurs while the valid flag is set escalates to #DF by TDX
module. It will result in an oops.

Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be
delivered until TDGETVEINFO is called.

For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/tdx.c             | 31 +++++++++++++
 arch/x86/include/asm/idtentry.h |  4 ++
 arch/x86/include/asm/tdx.h      | 21 +++++++++
 arch/x86/kernel/idt.c           |  3 ++
 arch/x86/kernel/traps.c         | 81 +++++++++++++++++++++++++++++++++
 5 files changed, 140 insertions(+)

diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 34818dc31248..6b2b738a2ba2 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -10,6 +10,7 @@
 
 /* TDX module Call Leaf IDs */
 #define TDX_GET_INFO			1
+#define TDX_GET_VEINFO			3
 
 /*
  * Wrapper for standard use of __tdx_hypercall with no output aside from
@@ -52,6 +53,36 @@ static void get_info(unsigned int *gpa_width)
 	*gpa_width = out.rcx & GENMASK(5, 0);
 }
 
+void tdx_get_ve_info(struct ve_info *ve)
+{
+	struct tdx_module_output out;
+
+	/*
+	 * Called during #VE handling to retrieve the #VE info from the
+	 * TDX module.
+	 *
+	 * This should called done early in #VE handling.  A "nested"
+	 * #VE which occurs before this will raise a #DF and is not
+	 * recoverable.
+	 */
+	tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);
+
+	/* Interrupts and NMIs can be delivered again. */
+	ve->exit_reason = out.rcx;
+	ve->exit_qual   = out.rdx;
+	ve->gla         = out.r8;
+	ve->gpa         = out.r9;
+	ve->instr_len   = lower_32_bits(out.r10);
+	ve->instr_info  = upper_32_bits(out.r10);
+}
+
+bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
+{
+	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+
+	return false;
+}
+
 void __init tdx_early_init(void)
 {
 	unsigned int gpa_width;
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..8ccc81d653b3 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -625,6 +625,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER,	exc_xen_hypervisor_callback);
 DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER,	exc_xen_unknown_trap);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE,		exc_virtualization_exception);
+#endif
+
 /* Device interrupts common/spurious */
 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	common_interrupt);
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 003c4d101297..8af81ea2779d 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -5,6 +5,7 @@
 
 #include <linux/bits.h>
 #include <linux/init.h>
+#include <asm/ptrace.h>
 
 #define TDX_CPUID_LEAF_ID	0x21
 #define TDX_IDENT		"IntelTDX    "
@@ -55,6 +56,22 @@ struct tdx_hypercall_args {
 	u64 r15;
 };
 
+/*
+ * Used by the #VE exception handler to gather the #VE exception
+ * info from the TDX module. This is a software only structure
+ * and not part of the TDX module/VMM ABI.
+ */
+struct ve_info {
+	u64 exit_reason;
+	u64 exit_qual;
+	/* Guest Linear (virtual) Address */
+	u64 gla;
+	/* Guest Physical Address */
+	u64 gpa;
+	u32 instr_len;
+	u32 instr_info;
+};
+
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 void __init tdx_early_init(void);
@@ -66,6 +83,10 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 /* Used to request services from the VMM */
 u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
 
+void tdx_get_ve_info(struct ve_info *ve);
+
+bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
+
 #else
 
 static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df0fa695bb09..1da074123c16 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = {
 	 */
 	INTG(X86_TRAP_PF,		asm_exc_page_fault),
 #endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 733b6490523c..1c3cb952fa2a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -62,6 +62,7 @@
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/vdso.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_X86_64
 #include <asm/x86_init.h>
@@ -1278,6 +1279,86 @@ DEFINE_IDTENTRY(exc_device_not_available)
 	}
 }
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+
+#define VE_FAULT_STR "VE fault"
+
+static void ve_raise_fault(struct pt_regs *regs, long error_code)
+{
+	if (user_mode(regs)) {
+		gp_user_force_sig_segv(regs, X86_TRAP_VE, error_code, VE_FAULT_STR);
+		return;
+	}
+
+	if (gp_try_fixup_and_notify(regs, X86_TRAP_VE, error_code, VE_FAULT_STR))
+		return;
+
+	die_addr(VE_FAULT_STR, regs, error_code, 0);
+}
+
+/*
+ * Virtualization Exceptions (#VE) are delivered to TDX guests due to
+ * specific guest actions which may happen in either user space or the
+ * kernel:
+ *
+ *  * Specific instructions (WBINVD, for example)
+ *  * Specific MSR accesses
+ *  * Specific CPUID leaf accesses
+ *  * Access to specific guest physical addresses
+ *
+ * In the settings that Linux will run in, virtualization exceptions are
+ * never generated on accesses to normal, TD-private memory that has been
+ * accepted.
+ *
+ * Syscall entry code has a critical window where the kernel stack is not
+ * yet set up. Any exception in this window leads to hard to debug issues
+ * and can be exploited for privilege escalation. Exceptions in the NMI
+ * entry code also cause issues. Returning from the exception handler with
+ * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
+ *
+ * For these reasons, the kernel avoids #VEs during the syscall gap and
+ * the NMI entry code. Entry code paths do not access TD-shared memory,
+ * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
+ * that might generate #VE. VMM can remove memory from TD at any point,
+ * but access to unaccepted (or missing) private memory leads to VM
+ * termination, not to #VE.
+ *
+ * Similarly to page faults and breakpoints, #VEs are allowed in NMI
+ * handlers once the kernel is ready to deal with nested NMIs.
+ *
+ * During #VE delivery, all interrupts, including NMIs, are blocked until
+ * TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
+ * the VE info.
+ *
+ * If a guest kernel action which would normally cause a #VE occurs in
+ * the interrupt-disabled region before TDGETVEINFO, a #DF (fault
+ * exception) is delivered to the guest which will result in an oops.
+ */
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+	struct ve_info ve;
+
+	/*
+	 * NMIs/Machine-checks/Interrupts will be in a disabled state
+	 * till TDGETVEINFO TDCALL is executed. This ensures that VE
+	 * info cannot be overwritten by a nested #VE.
+	 */
+	tdx_get_ve_info(&ve);
+
+	cond_local_irq_enable(regs);
+
+	/*
+	 * If tdx_handle_virt_exception() could not process
+	 * it successfully, treat it as #GP(0) and handle it.
+	 */
+	if (!tdx_handle_virt_exception(regs, &ve))
+		ve_raise_fault(regs, 0);
+
+	cond_local_irq_disable(regs);
+}
+
+#endif
+
 #ifdef CONFIG_X86_32
 DEFINE_IDTENTRY_SW(iret_error)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 08/30] x86/tdx: Add HLT support for TDX guests
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 07/30] x86/traps: Add #VE support for TDX guest Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 09/30] x86/tdx: Add MSR " Kirill A. Shutemov
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov,
	Dave Hansen

The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).

To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].

In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.

To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.

Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/coco/tdcall.S     | 13 ++++++
 arch/x86/coco/tdx.c        | 93 +++++++++++++++++++++++++++++++++++++-
 arch/x86/include/asm/tdx.h |  4 ++
 arch/x86/kernel/process.c  |  4 ++
 4 files changed, 112 insertions(+), 2 deletions(-)

diff --git a/arch/x86/coco/tdcall.S b/arch/x86/coco/tdcall.S
index 4767e0b5f0d9..29e81104d312 100644
--- a/arch/x86/coco/tdcall.S
+++ b/arch/x86/coco/tdcall.S
@@ -138,6 +138,19 @@ SYM_FUNC_START(__tdx_hypercall)
 
 	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
 
+	/*
+	 * For the idle loop STI needs to be called directly before the TDCALL
+	 * that enters idle (EXIT_REASON_HLT case). STI instruction enables
+	 * interrupts only one instruction later. If there is a window between
+	 * STI and the instruction that emulates the HALT state, there is a
+	 * chance for interrupts to happen in this window, which can delay the
+	 * HLT operation indefinitely. Since this is the not the desired
+	 * result, conditionally call STI before TDCALL.
+	 */
+	testq $TDX_HCALL_ISSUE_STI, %rsi
+	jz .Lskip_sti
+	sti
+.Lskip_sti:
 	tdcall
 
 	/*
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 6b2b738a2ba2..0c8214e1cdb5 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -7,6 +7,7 @@
 #include <linux/cpufeature.h>
 #include <asm/coco.h>
 #include <asm/tdx.h>
+#include <asm/vmx.h>
 
 /* TDX module Call Leaf IDs */
 #define TDX_GET_INFO			1
@@ -30,6 +31,17 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
 	return __tdx_hypercall(&args, 0);
 }
 
+/*
+ * The TDG.VP.VMCALL-Instruction-execution sub-functions are defined
+ * independently from but are currently matched 1:1 with VMX EXIT_REASONs.
+ * Reusing the KVM EXIT_REASON macros makes it easier to connect the host and
+ * guest sides of these calls.
+ */
+static u64 hcall_func(u64 exit_reason)
+{
+	return exit_reason;
+}
+
 static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 				   struct tdx_module_output *out)
 {
@@ -53,6 +65,62 @@ static void get_info(unsigned int *gpa_width)
 	*gpa_width = out.rcx & GENMASK(5, 0);
 }
 
+static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = hcall_func(EXIT_REASON_HLT),
+		.r12 = irq_disabled,
+	};
+
+	/*
+	 * Emulate HLT operation via hypercall. More info about ABI
+	 * can be found in TDX Guest-Host-Communication Interface
+	 * (GHCI), section 3.8 TDG.VP.VMCALL<Instruction.HLT>.
+	 *
+	 * The VMM uses the "IRQ disabled" param to understand IRQ
+	 * enabled status (RFLAGS.IF) of the TD guest and to determine
+	 * whether or not it should schedule the halted vCPU if an
+	 * IRQ becomes pending. E.g. if IRQs are disabled, the VMM
+	 * can keep the vCPU in virtual HLT, even if an IRQ is
+	 * pending, without hanging/breaking the guest.
+	 */
+	return __tdx_hypercall(&args, do_sti ? TDX_HCALL_ISSUE_STI : 0);
+}
+
+static bool handle_halt(void)
+{
+	/*
+	 * Since non safe halt is mainly used in CPU offlining
+	 * and the guest will always stay in the halt state, don't
+	 * call the STI instruction (set do_sti as false).
+	 */
+	const bool irq_disabled = irqs_disabled();
+	const bool do_sti = false;
+
+	if (__halt(irq_disabled, do_sti))
+		return false;
+
+	return true;
+}
+
+void __cpuidle tdx_safe_halt(void)
+{
+	 /*
+	  * For do_sti=true case, __tdx_hypercall() function enables
+	  * interrupts using the STI instruction before the TDCALL. So
+	  * set irq_disabled as false.
+	  */
+	const bool irq_disabled = false;
+	const bool do_sti = true;
+
+	/*
+	 * Use WARN_ONCE() to report the failure.
+	 */
+	if (__halt(irq_disabled, do_sti))
+		WARN_ONCE(1, "HLT instruction emulation failed\n");
+}
+
 void tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
@@ -76,11 +144,32 @@ void tdx_get_ve_info(struct ve_info *ve)
 	ve->instr_info  = upper_32_bits(out.r10);
 }
 
+/* Handle the kernel #VE */
+static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
+{
+	switch (ve->exit_reason) {
+	case EXIT_REASON_HLT:
+		return handle_halt();
+	default:
+		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+		return false;
+	}
+}
+
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
 {
-	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+	bool ret;
+
+	if (user_mode(regs))
+		ret = false;
+	else
+		ret = virt_exception_kernel(regs, ve);
+
+	/* After successful #VE handling, move the IP */
+	if (ret)
+		regs->ip += ve->instr_len;
 
-	return false;
+	return ret;
 }
 
 void __init tdx_early_init(void)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8af81ea2779d..1f150e7a2f8f 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -13,6 +13,7 @@
 #define TDX_HYPERCALL_STANDARD  0
 
 #define TDX_HCALL_HAS_OUTPUT	BIT(0)
+#define TDX_HCALL_ISSUE_STI	BIT(1)
 
 /*
  * SW-defined error codes.
@@ -87,9 +88,12 @@ void tdx_get_ve_info(struct ve_info *ve);
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
 
+void tdx_safe_halt(void);
+
 #else
 
 static inline void tdx_early_init(void) { };
+static inline void tdx_safe_halt(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e131d71b3cae..2e90d57cf86e 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -46,6 +46,7 @@
 #include <asm/proto.h>
 #include <asm/frame.h>
 #include <asm/unwind.h>
+#include <asm/tdx.h>
 
 #include "process.h"
 
@@ -873,6 +874,9 @@ void select_idle_routine(const struct cpuinfo_x86 *c)
 	} else if (prefer_mwait_c1_over_halt(c)) {
 		pr_info("using mwait in idle threads\n");
 		x86_idle = mwait_idle;
+	} else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+		pr_info("using TDX aware idle routine\n");
+		x86_idle = tdx_safe_halt;
 	} else
 		x86_idle = default_idle;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 09/30] x86/tdx: Add MSR support for TDX guests
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 08/30] x86/tdx: Add HLT support for TDX guests Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 10/30] x86/tdx: Handle CPUID via #VE Kirill A. Shutemov
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov,
	Dave Hansen

Use hypercall to emulate MSR read/write for the TDX platform.

There are two viable approaches for doing MSRs in a TD guest:

1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal
   do. Some will succeed, others will cause a #VE. All of those that
   cause a #VE will be handled with a TDCALL.
2. Use paravirt infrastructure.  The paravirt hook has to keep a list
   of which MSRs would cause a #VE and use a TDCALL.  All other MSRs
   execute RDMSR/WRMSR instructions directly.

The second option can be ruled out because the list of MSRs was
challenging to maintain. That leaves option #1 as the only viable
solution for the minimal TDX support.

For performance-critical MSR writes (like TSC_DEADLINE), future patches
will replace the WRMSR/#VE sequence with the direct TDCALL.

RDMSR and WRMSR specification details can be found in
Guest-Host-Communication Interface (GHCI) for Intel Trust Domain
Extensions (Intel TDX) specification, sec titled "TDG.VP.
VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>".

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/coco/tdx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 0c8214e1cdb5..f3c6767a42d2 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -121,6 +121,44 @@ void __cpuidle tdx_safe_halt(void)
 		WARN_ONCE(1, "HLT instruction emulation failed\n");
 }
 
+static bool read_msr(struct pt_regs *regs)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = hcall_func(EXIT_REASON_MSR_READ),
+		.r12 = regs->cx,
+	};
+
+	/*
+	 * Emulate the MSR read via hypercall. More info about ABI
+	 * can be found in TDX Guest-Host-Communication Interface
+	 * (GHCI), section titled "TDG.VP.VMCALL<Instruction.RDMSR>".
+	 */
+	if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
+		return false;
+
+	regs->ax = lower_32_bits(args.r11);
+	regs->dx = upper_32_bits(args.r11);
+	return true;
+}
+
+static bool write_msr(struct pt_regs *regs)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = hcall_func(EXIT_REASON_MSR_WRITE),
+		.r12 = regs->cx,
+		.r13 = (u64)regs->dx << 32 | regs->ax,
+	};
+
+	/*
+	 * Emulate the MSR write via hypercall. More info about ABI
+	 * can be found in TDX Guest-Host-Communication Interface
+	 * (GHCI) section titled "TDG.VP.VMCALL<Instruction.WRMSR>".
+	 */
+	return !__tdx_hypercall(&args, 0);
+}
+
 void tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
@@ -150,6 +188,10 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 	switch (ve->exit_reason) {
 	case EXIT_REASON_HLT:
 		return handle_halt();
+	case EXIT_REASON_MSR_READ:
+		return read_msr(regs);
+	case EXIT_REASON_MSR_WRITE:
+		return write_msr(regs);
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return false;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 10/30] x86/tdx: Handle CPUID via #VE
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 09/30] x86/tdx: Add MSR " Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-08 20:33   ` Dave Hansen
  2022-03-02 14:27 ` [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO Kirill A. Shutemov
                   ` (19 subsequent siblings)
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov

In TDX guests, most CPUID leaf/sub-leaf combinations are virtualized
by the TDX module while some trigger #VE.

Implement the #VE handling for EXIT_REASON_CPUID by handing it through
the hypercall, which in turn lets the TDX module handle it by invoking
the host VMM.

More details on CPUID Virtualization can be found in the TDX module
specification, the section titled "CPUID Virtualization".

Note that VMM that handles the hypercall is not trusted. It can return
data that may steer the guest kernel in wrong direct. Only allow  VMM
to control range reserved for hypervisor communication. Return all-zeros
for any CPUID outside the range.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/tdx.c | 57 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index f3c6767a42d2..d00b367f8052 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -159,6 +159,47 @@ static bool write_msr(struct pt_regs *regs)
 	return !__tdx_hypercall(&args, 0);
 }
 
+static bool handle_cpuid(struct pt_regs *regs)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = hcall_func(EXIT_REASON_CPUID),
+		.r12 = regs->ax,
+		.r13 = regs->cx,
+	};
+
+	/*
+	 * Only allow VMM to control range reserved for hypervisor
+	 * communication.
+	 *
+	 * Return all-zeros for any CPUID outside the range.
+	 */
+	if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
+		regs->ax = regs->bx = regs->cx = regs->dx = 0;
+		return true;
+	}
+
+	/*
+	 * Emulate the CPUID instruction via a hypercall. More info about
+	 * ABI can be found in TDX Guest-Host-Communication Interface
+	 * (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
+	 */
+	if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
+		return false;
+
+	/*
+	 * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
+	 * EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
+	 * So copy the register contents back to pt_regs.
+	 */
+	regs->ax = args.r12;
+	regs->bx = args.r13;
+	regs->cx = args.r14;
+	regs->dx = args.r15;
+
+	return true;
+}
+
 void tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
@@ -182,6 +223,18 @@ void tdx_get_ve_info(struct ve_info *ve)
 	ve->instr_info  = upper_32_bits(out.r10);
 }
 
+/* Handle the user initiated #VE */
+static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
+{
+	switch (ve->exit_reason) {
+	case EXIT_REASON_CPUID:
+		return handle_cpuid(regs);
+	default:
+		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+		return false;
+	}
+}
+
 /* Handle the kernel #VE */
 static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 {
@@ -192,6 +245,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 		return read_msr(regs);
 	case EXIT_REASON_MSR_WRITE:
 		return write_msr(regs);
+	case EXIT_REASON_CPUID:
+		return handle_cpuid(regs);
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return false;
@@ -203,7 +258,7 @@ bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
 	bool ret;
 
 	if (user_mode(regs))
-		ret = false;
+		ret = virt_exception_user(regs, ve);
 	else
 		ret = virt_exception_kernel(regs, ve);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 10/30] x86/tdx: Handle CPUID via #VE Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-08 21:26   ` Dave Hansen
  2022-03-02 14:27 ` [PATCHv5 12/30] x86/tdx: Detect TDX at early kernel decompression time Kirill A. Shutemov
                   ` (18 subsequent siblings)
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov

In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.

To emulate an instruction an emulator needs two things:

  - R/W access to the register file to read/modify instruction arguments
    and see RIP of the faulted instruction.

  - Read access to memory where instruction is placed to see what to
    emulate. In this case it is guest kernel text.

Both of them are not available to VMM in TDX environment:

  - Register file is never exposed to VMM. When a TD exits to the module,
    it saves registers into the state-save area allocated for that TD.
    The module then scrubs these registers before returning execution
    control to the VMM, to help prevent leakage of TD state.

  - Memory is encrypted a TD-private key. The CPU disallows software
    other than the TDX module and TDs from making memory accesses using
    the private key.

In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.

Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.

MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.

Any CPU instruction that accesses memory can also be used to access
MMIO.  However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.

The io.h helpers intentionally use a limited set of instructions when
accessing MMIO.  This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.

MMIO accesses are performed without the io.h helpers are at the mercy of
the compiler.  Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated.  TDX
guests will oops if they encounter one of these decoding failures.

This means that TDX guests *must* use the io.h helpers to access MMIO.

This requirement is not new.  Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.

=== Potential alternative approaches ===

== Paravirtualizing all MMIO ==

An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.

Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.

However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.

Many drivers will never be used in the TDX environment and the bloat
cannot be justified.

== Patching TDX drivers ==

Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests.  Right now, that's
limited only to virtio and some x86-specific drivers.

All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.

This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/tdx.c | 114 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 114 insertions(+)

diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index d00b367f8052..e6163e7e3247 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -8,11 +8,17 @@
 #include <asm/coco.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
+#include <asm/insn.h>
+#include <asm/insn-eval.h>
 
 /* TDX module Call Leaf IDs */
 #define TDX_GET_INFO			1
 #define TDX_GET_VEINFO			3
 
+/* MMIO direction */
+#define EPT_READ	0
+#define EPT_WRITE	1
+
 /*
  * Wrapper for standard use of __tdx_hypercall with no output aside from
  * return code.
@@ -200,6 +206,112 @@ static bool handle_cpuid(struct pt_regs *regs)
 	return true;
 }
 
+static bool mmio_read(int size, unsigned long addr, unsigned long *val)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = hcall_func(EXIT_REASON_EPT_VIOLATION),
+		.r12 = size,
+		.r13 = EPT_READ,
+		.r14 = addr,
+		.r15 = *val,
+	};
+
+	if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
+		return false;
+	*val = args.r11;
+	return true;
+}
+
+static bool mmio_write(int size, unsigned long addr, unsigned long val)
+{
+	return !_tdx_hypercall(hcall_func(EXIT_REASON_EPT_VIOLATION), size,
+			       EPT_WRITE, addr, val);
+}
+
+static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+	char buffer[MAX_INSN_SIZE];
+	unsigned long *reg, val;
+	struct insn insn = {};
+	enum mmio_type mmio;
+	int size, extend_size;
+	u8 extend_val = 0;
+
+	if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
+		return false;
+
+	if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
+		return false;
+
+	mmio = insn_decode_mmio(&insn, &size);
+	if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
+		return false;
+
+	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
+		reg = insn_get_modrm_reg_ptr(&insn, regs);
+		if (!reg)
+			return false;
+	}
+
+	ve->instr_len = insn.length;
+
+	switch (mmio) {
+	case MMIO_WRITE:
+		memcpy(&val, reg, size);
+		return mmio_write(size, ve->gpa, val);
+	case MMIO_WRITE_IMM:
+		val = insn.immediate.value;
+		return mmio_write(size, ve->gpa, val);
+	case MMIO_READ:
+	case MMIO_READ_ZERO_EXTEND:
+	case MMIO_READ_SIGN_EXTEND:
+		break;
+	case MMIO_MOVS:
+	case MMIO_DECODE_FAILED:
+		/*
+		 * MMIO was accessed with an instruction that could not be
+		 * decoded or handled properly. It was likely not using io.h
+		 * helpers or accessed MMIO accidentally.
+		 */
+		return false;
+	default:
+		/* Unknown insn_decode_mmio() decode value? */
+		BUG();
+	}
+
+	/* Handle reads */
+	if (!mmio_read(size, ve->gpa, &val))
+		return false;
+
+	switch (mmio) {
+	case MMIO_READ:
+		/* Zero-extend for 32-bit operation */
+		extend_size = size == 4 ? sizeof(*reg) : 0;
+		break;
+	case MMIO_READ_ZERO_EXTEND:
+		/* Zero extend based on operand size */
+		extend_size = insn.opnd_bytes;
+		break;
+	case MMIO_READ_SIGN_EXTEND:
+		/* Sign extend based on operand size */
+		extend_size = insn.opnd_bytes;
+		if (size == 1 && val & BIT(7))
+			extend_val = 0xFF;
+		else if (size > 1 && val & BIT(15))
+			extend_val = 0xFF;
+		break;
+	default:
+		/* All other cases has to be covered with the first switch() */
+		BUG();
+	}
+
+	if (extend_size)
+		memset(reg, extend_val, extend_size);
+	memcpy(reg, &val, size);
+	return true;
+}
+
 void tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
@@ -247,6 +359,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 		return write_msr(regs);
 	case EXIT_REASON_CPUID:
 		return handle_cpuid(regs);
+	case EXIT_REASON_EPT_VIOLATION:
+		return handle_mmio(regs, ve);
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return false;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 12/30] x86/tdx: Detect TDX at early kernel decompression time
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-07 22:27   ` [PATCHv5.1 " Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 13/30] x86: Adjust types used in port I/O helpers Kirill A. Shutemov
                   ` (17 subsequent siblings)
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A . Shutemov,
	Dave Hansen

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

The early decompression code does port I/O for its console output. But,
handling the decompression-time port I/O demands a different approach
from normal runtime because the IDT required to support #VE based port
I/O emulation is not yet set up. Paravirtualizing I/O calls during
the decompression step is acceptable because the decompression code
doesn't have a lot of call sites to IO instruction.

To support port I/O in decompression code, TDX must be detected before
the decompression code might do port I/O. Detect whether the kernel runs
in a TDX guest.

Add an early_is_tdx_guest() interface to query the cached TDX guest
status in the decompression code.

TDX is detected with CPUID. Make cpuid_count() accessible outside
boot/cpuflags.c.

TDX detection in the main kernel is very similar. Move common bits
into <asm/shared/tdx.h>.

The actual port I/O paravirtualization will come later in the series.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile |  1 +
 arch/x86/boot/compressed/misc.c   |  8 ++++++++
 arch/x86/boot/compressed/misc.h   |  2 ++
 arch/x86/boot/compressed/tdx.c    | 27 +++++++++++++++++++++++++++
 arch/x86/boot/compressed/tdx.h    | 15 +++++++++++++++
 arch/x86/boot/cpuflags.c          |  3 +--
 arch/x86/boot/cpuflags.h          |  1 +
 arch/x86/include/asm/shared/tdx.h |  8 ++++++++
 arch/x86/include/asm/tdx.h        |  4 +---
 9 files changed, 64 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdx.c
 create mode 100644 arch/x86/boot/compressed/tdx.h
 create mode 100644 arch/x86/include/asm/shared/tdx.h

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6115274fe10f..732f6b21ecbd 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -101,6 +101,7 @@ ifdef CONFIG_X86_64
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index a4339cb2d247..2b1169869b96 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -370,6 +370,14 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 	lines = boot_params->screen_info.orig_video_lines;
 	cols = boot_params->screen_info.orig_video_cols;
 
+	/*
+	 * Detect TDX guest environment.
+	 *
+	 * It has to be done before console_init() in order to use
+	 * paravirtualized port I/O operations if needed.
+	 */
+	early_tdx_detect();
+
 	console_init();
 
 	/*
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 16ed360b6692..0d8e275a9d96 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -28,6 +28,8 @@
 #include <asm/bootparam.h>
 #include <asm/desc_defs.h>
 
+#include "tdx.h"
+
 #define BOOT_CTYPE_H
 #include <linux/acpi.h>
 
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
new file mode 100644
index 000000000000..dec68c184358
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.c
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include "../cpuflags.h"
+#include "../string.h"
+
+#include <asm/shared/tdx.h>
+
+static bool tdx_guest_detected;
+
+bool early_is_tdx_guest(void)
+{
+	return tdx_guest_detected;
+}
+
+void early_tdx_detect(void)
+{
+	u32 eax, sig[3];
+
+	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
+
+	BUILD_BUG_ON(sizeof(sig) != sizeof(TDX_IDENT) - 1);
+	if (memcmp(TDX_IDENT, sig, sizeof(sig)))
+		return;
+
+	/* Cache TDX guest feature status */
+	tdx_guest_detected = true;
+}
diff --git a/arch/x86/boot/compressed/tdx.h b/arch/x86/boot/compressed/tdx.h
new file mode 100644
index 000000000000..a7bff6ae002e
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BOOT_COMPRESSED_TDX_H
+#define BOOT_COMPRESSED_TDX_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+void early_tdx_detect(void);
+bool early_is_tdx_guest(void);
+#else
+static inline void early_tdx_detect(void) { };
+static inline bool early_is_tdx_guest(void) { return false; }
+#endif
+
+#endif /* BOOT_COMPRESSED_TDX_H */
diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
index a0b75f73dc63..a83d67ec627d 100644
--- a/arch/x86/boot/cpuflags.c
+++ b/arch/x86/boot/cpuflags.c
@@ -71,8 +71,7 @@ int has_eflag(unsigned long mask)
 # define EBX_REG "=b"
 #endif
 
-static inline void cpuid_count(u32 id, u32 count,
-		u32 *a, u32 *b, u32 *c, u32 *d)
+void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d)
 {
 	asm volatile(".ifnc %%ebx,%3 ; movl  %%ebx,%3 ; .endif	\n\t"
 		     "cpuid					\n\t"
diff --git a/arch/x86/boot/cpuflags.h b/arch/x86/boot/cpuflags.h
index 2e20814d3ce3..475b8fde90f7 100644
--- a/arch/x86/boot/cpuflags.h
+++ b/arch/x86/boot/cpuflags.h
@@ -17,5 +17,6 @@ extern u32 cpu_vendor[3];
 
 int has_eflag(unsigned long mask);
 void get_cpuflags(void);
+void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d);
 
 #endif
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
new file mode 100644
index 000000000000..8209ba9ffe1a
--- /dev/null
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SHARED_TDX_H
+#define _ASM_X86_SHARED_TDX_H
+
+#define TDX_CPUID_LEAF_ID	0x21
+#define TDX_IDENT		"IntelTDX    "
+
+#endif /* _ASM_X86_SHARED_TDX_H */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 1f150e7a2f8f..76cffbda0e79 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -6,9 +6,7 @@
 #include <linux/bits.h>
 #include <linux/init.h>
 #include <asm/ptrace.h>
-
-#define TDX_CPUID_LEAF_ID	0x21
-#define TDX_IDENT		"IntelTDX    "
+#include <asm/shared/tdx.h>
 
 #define TDX_HYPERCALL_STANDARD  0
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 13/30] x86: Adjust types used in port I/O helpers
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (11 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 12/30] x86/tdx: Detect TDX at early kernel decompression time Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 14/30] x86: Consolidate " Kirill A. Shutemov
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov,
	Dave Hansen

Change port I/O helpers to use u8/u16/u32 instead of unsigned
char/short/int for values. Use u16 instead of int for port number.

It aligns the helpers with implementation in boot stub in preparation
for consolidation.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/include/asm/io.h | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index f6d91ecb8026..638c1a2a82e0 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -258,37 +258,37 @@ static inline void slow_down_io(void)
 #endif
 
 #define BUILDIO(bwl, bw, type)						\
-static inline void out##bwl(unsigned type value, int port)		\
+static inline void out##bwl(type value, u16 port)			\
 {									\
 	asm volatile("out" #bwl " %" #bw "0, %w1"			\
 		     : : "a"(value), "Nd"(port));			\
 }									\
 									\
-static inline unsigned type in##bwl(int port)				\
+static inline type in##bwl(u16 port)					\
 {									\
-	unsigned type value;						\
+	type value;							\
 	asm volatile("in" #bwl " %w1, %" #bw "0"			\
 		     : "=a"(value) : "Nd"(port));			\
 	return value;							\
 }									\
 									\
-static inline void out##bwl##_p(unsigned type value, int port)		\
+static inline void out##bwl##_p(type value, u16 port)			\
 {									\
 	out##bwl(value, port);						\
 	slow_down_io();							\
 }									\
 									\
-static inline unsigned type in##bwl##_p(int port)			\
+static inline type in##bwl##_p(u16 port)				\
 {									\
-	unsigned type value = in##bwl(port);				\
+	type value = in##bwl(port);					\
 	slow_down_io();							\
 	return value;							\
 }									\
 									\
-static inline void outs##bwl(int port, const void *addr, unsigned long count) \
+static inline void outs##bwl(u16 port, const void *addr, unsigned long count) \
 {									\
 	if (cc_platform_has(CC_ATTR_GUEST_UNROLL_STRING_IO)) {		\
-		unsigned type *value = (unsigned type *)addr;		\
+		type *value = (type *)addr;				\
 		while (count) {						\
 			out##bwl(*value, port);				\
 			value++;					\
@@ -301,10 +301,10 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 	}								\
 }									\
 									\
-static inline void ins##bwl(int port, void *addr, unsigned long count)	\
+static inline void ins##bwl(u16 port, void *addr, unsigned long count)	\
 {									\
 	if (cc_platform_has(CC_ATTR_GUEST_UNROLL_STRING_IO)) {		\
-		unsigned type *value = (unsigned type *)addr;		\
+		type *value = (type *)addr;				\
 		while (count) {						\
 			*value = in##bwl(port);				\
 			value++;					\
@@ -317,9 +317,9 @@ static inline void ins##bwl(int port, void *addr, unsigned long count)	\
 	}								\
 }
 
-BUILDIO(b, b, char)
-BUILDIO(w, w, short)
-BUILDIO(l, , int)
+BUILDIO(b, b, u8)
+BUILDIO(w, w, u16)
+BUILDIO(l,  , u32)
 
 #define inb inb
 #define inw inw
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 14/30] x86: Consolidate port I/O helpers
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 13/30] x86: Adjust types used in port I/O helpers Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 15/30] x86/boot: Port I/O: allow to hook up alternative helpers Kirill A. Shutemov
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov

There are two implementations of port I/O helpers: one in the kernel and
one in the boot stub.

Move the helpers required for both to <asm/shared/io.h> and use the one
implementation everywhere.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/boot/boot.h             | 35 +-------------------------------
 arch/x86/boot/compressed/misc.h  |  2 +-
 arch/x86/include/asm/io.h        | 22 ++------------------
 arch/x86/include/asm/shared/io.h | 34 +++++++++++++++++++++++++++++++
 4 files changed, 38 insertions(+), 55 deletions(-)
 create mode 100644 arch/x86/include/asm/shared/io.h

diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 34c9dbb6a47d..22a474c5b3e8 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -23,6 +23,7 @@
 #include <linux/edd.h>
 #include <asm/setup.h>
 #include <asm/asm.h>
+#include <asm/shared/io.h>
 #include "bitops.h"
 #include "ctype.h"
 #include "cpuflags.h"
@@ -35,40 +36,6 @@ extern struct boot_params boot_params;
 
 #define cpu_relax()	asm volatile("rep; nop")
 
-/* Basic port I/O */
-static inline void outb(u8 v, u16 port)
-{
-	asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u8 inb(u16 port)
-{
-	u8 v;
-	asm volatile("inb %1,%0" : "=a" (v) : "dN" (port));
-	return v;
-}
-
-static inline void outw(u16 v, u16 port)
-{
-	asm volatile("outw %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u16 inw(u16 port)
-{
-	u16 v;
-	asm volatile("inw %1,%0" : "=a" (v) : "dN" (port));
-	return v;
-}
-
-static inline void outl(u32 v, u16 port)
-{
-	asm volatile("outl %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u32 inl(u16 port)
-{
-	u32 v;
-	asm volatile("inl %1,%0" : "=a" (v) : "dN" (port));
-	return v;
-}
-
 static inline void io_delay(void)
 {
 	const u16 DELAY_PORT = 0x80;
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 0d8e275a9d96..8a253e85f990 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -22,11 +22,11 @@
 #include <linux/linkage.h>
 #include <linux/screen_info.h>
 #include <linux/elf.h>
-#include <linux/io.h>
 #include <asm/page.h>
 #include <asm/boot.h>
 #include <asm/bootparam.h>
 #include <asm/desc_defs.h>
+#include <asm/shared/io.h>
 
 #include "tdx.h"
 
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 638c1a2a82e0..a1eb218a49f8 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -44,6 +44,7 @@
 #include <asm/page.h>
 #include <asm/early_ioremap.h>
 #include <asm/pgtable_types.h>
+#include <asm/shared/io.h>
 
 #define build_mmio_read(name, size, type, reg, barrier) \
 static inline type name(const volatile void __iomem *addr) \
@@ -258,20 +259,6 @@ static inline void slow_down_io(void)
 #endif
 
 #define BUILDIO(bwl, bw, type)						\
-static inline void out##bwl(type value, u16 port)			\
-{									\
-	asm volatile("out" #bwl " %" #bw "0, %w1"			\
-		     : : "a"(value), "Nd"(port));			\
-}									\
-									\
-static inline type in##bwl(u16 port)					\
-{									\
-	type value;							\
-	asm volatile("in" #bwl " %w1, %" #bw "0"			\
-		     : "=a"(value) : "Nd"(port));			\
-	return value;							\
-}									\
-									\
 static inline void out##bwl##_p(type value, u16 port)			\
 {									\
 	out##bwl(value, port);						\
@@ -320,10 +307,8 @@ static inline void ins##bwl(u16 port, void *addr, unsigned long count)	\
 BUILDIO(b, b, u8)
 BUILDIO(w, w, u16)
 BUILDIO(l,  , u32)
+#undef BUILDIO
 
-#define inb inb
-#define inw inw
-#define inl inl
 #define inb_p inb_p
 #define inw_p inw_p
 #define inl_p inl_p
@@ -331,9 +316,6 @@ BUILDIO(l,  , u32)
 #define insw insw
 #define insl insl
 
-#define outb outb
-#define outw outw
-#define outl outl
 #define outb_p outb_p
 #define outw_p outw_p
 #define outl_p outl_p
diff --git a/arch/x86/include/asm/shared/io.h b/arch/x86/include/asm/shared/io.h
new file mode 100644
index 000000000000..6707cd555f0c
--- /dev/null
+++ b/arch/x86/include/asm/shared/io.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SHARED_IO_H
+#define _ASM_X86_SHARED_IO_H
+
+#include <linux/types.h>
+
+#define BUILDIO(bwl, bw, type)						\
+static inline void out##bwl(type value, u16 port)			\
+{									\
+	asm volatile("out" #bwl " %" #bw "0, %w1"			\
+		     : : "a"(value), "Nd"(port));			\
+}									\
+									\
+static inline type in##bwl(u16 port)					\
+{									\
+	type value;							\
+	asm volatile("in" #bwl " %w1, %" #bw "0"			\
+		     : "=a"(value) : "Nd"(port));			\
+	return value;							\
+}
+
+BUILDIO(b, b, u8)
+BUILDIO(w, w, u16)
+BUILDIO(l,  , u32)
+#undef BUILDIO
+
+#define inb inb
+#define inw inw
+#define inl inl
+#define outb outb
+#define outw outw
+#define outl outl
+
+#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 15/30] x86/boot: Port I/O: allow to hook up alternative helpers
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (13 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 14/30] x86: Consolidate " Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-02 17:42   ` Josh Poimboeuf
  2022-03-02 14:27 ` [PATCHv5 16/30] x86/boot: Port I/O: add decompression-time support for TDX Kirill A. Shutemov
                   ` (14 subsequent siblings)
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov,
	Dave Hansen

Port I/O instructions trigger #VE in the TDX environment. In response to
the exception, kernel emulates these instructions using hypercalls.

But during early boot, on the decompression stage, it is cumbersome to
deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
handling.

Add a way to hook up alternative port I/O helpers in the boot stub with
a new pio_ops structure.  For now, set the ops structure to just call
the normal I/O operation functions.

The approach has down sides: TDX boot will fail if any code bypass
pio_ops and go for direct port I/O helper. The failure will only be
visible on TDX boot (or other user of alternative pio_ops).

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/boot/a20.c                  | 14 ++++++------
 arch/x86/boot/boot.h                 |  2 +-
 arch/x86/boot/compressed/misc.c      | 18 ++++++++++------
 arch/x86/boot/compressed/misc.h      |  2 +-
 arch/x86/boot/early_serial_console.c | 28 ++++++++++++------------
 arch/x86/boot/io.h                   | 32 ++++++++++++++++++++++++++++
 arch/x86/boot/main.c                 |  4 ++++
 arch/x86/boot/pm.c                   | 10 ++++-----
 arch/x86/boot/tty.c                  |  4 ++--
 arch/x86/boot/video-vga.c            |  6 +++---
 arch/x86/boot/video.h                |  8 ++++---
 arch/x86/realmode/rm/wakemain.c      | 14 +++++++-----
 12 files changed, 95 insertions(+), 47 deletions(-)
 create mode 100644 arch/x86/boot/io.h

diff --git a/arch/x86/boot/a20.c b/arch/x86/boot/a20.c
index a2b6b428922a..7f6dd5cc4670 100644
--- a/arch/x86/boot/a20.c
+++ b/arch/x86/boot/a20.c
@@ -25,7 +25,7 @@ static int empty_8042(void)
 	while (loops--) {
 		io_delay();
 
-		status = inb(0x64);
+		status = pio_ops.inb(0x64);
 		if (status == 0xff) {
 			/* FF is a plausible, but very unlikely status */
 			if (!--ffs)
@@ -34,7 +34,7 @@ static int empty_8042(void)
 		if (status & 1) {
 			/* Read and discard input data */
 			io_delay();
-			(void)inb(0x60);
+			(void)pio_ops.inb(0x60);
 		} else if (!(status & 2)) {
 			/* Buffers empty, finished! */
 			return 0;
@@ -99,13 +99,13 @@ static void enable_a20_kbc(void)
 {
 	empty_8042();
 
-	outb(0xd1, 0x64);	/* Command write */
+	pio_ops.outb(0xd1, 0x64);	/* Command write */
 	empty_8042();
 
-	outb(0xdf, 0x60);	/* A20 on */
+	pio_ops.outb(0xdf, 0x60);	/* A20 on */
 	empty_8042();
 
-	outb(0xff, 0x64);	/* Null command, but UHCI wants it */
+	pio_ops.outb(0xff, 0x64);	/* Null command, but UHCI wants it */
 	empty_8042();
 }
 
@@ -113,10 +113,10 @@ static void enable_a20_fast(void)
 {
 	u8 port_a;
 
-	port_a = inb(0x92);	/* Configuration port A */
+	port_a = pio_ops.inb(0x92);	/* Configuration port A */
 	port_a |=  0x02;	/* Enable A20 */
 	port_a &= ~0x01;	/* Do not reset machine */
-	outb(port_a, 0x92);
+	pio_ops.outb(port_a, 0x92);
 }
 
 /*
diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 22a474c5b3e8..bd8f640ca15f 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -23,10 +23,10 @@
 #include <linux/edd.h>
 #include <asm/setup.h>
 #include <asm/asm.h>
-#include <asm/shared/io.h>
 #include "bitops.h"
 #include "ctype.h"
 #include "cpuflags.h"
+#include "io.h"
 
 /* Useful macros */
 #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 2b1169869b96..ff0e1b977514 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -47,6 +47,8 @@ void *memmove(void *dest, const void *src, size_t n);
  */
 struct boot_params *boot_params;
 
+struct port_io_ops pio_ops;
+
 memptr free_mem_ptr;
 memptr free_mem_end_ptr;
 
@@ -103,10 +105,12 @@ static void serial_putchar(int ch)
 {
 	unsigned timeout = 0xffff;
 
-	while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
+	while ((pio_ops.inb(early_serial_base + LSR) & XMTRDY) == 0 &&
+	       --timeout) {
 		cpu_relax();
+	}
 
-	outb(ch, early_serial_base + TXR);
+	pio_ops.outb(ch, early_serial_base + TXR);
 }
 
 void __putstr(const char *s)
@@ -152,10 +156,10 @@ void __putstr(const char *s)
 	boot_params->screen_info.orig_y = y;
 
 	pos = (x + cols * y) * 2;	/* Update cursor position */
-	outb(14, vidport);
-	outb(0xff & (pos >> 9), vidport+1);
-	outb(15, vidport);
-	outb(0xff & (pos >> 1), vidport+1);
+	pio_ops.outb(14, vidport);
+	pio_ops.outb(0xff & (pos >> 9), vidport+1);
+	pio_ops.outb(15, vidport);
+	pio_ops.outb(0xff & (pos >> 1), vidport+1);
 }
 
 void __puthex(unsigned long value)
@@ -370,6 +374,8 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 	lines = boot_params->screen_info.orig_video_lines;
 	cols = boot_params->screen_info.orig_video_cols;
 
+	init_default_io_ops();
+
 	/*
 	 * Detect TDX guest environment.
 	 *
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 8a253e85f990..ea71cf3d64e1 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -26,7 +26,6 @@
 #include <asm/boot.h>
 #include <asm/bootparam.h>
 #include <asm/desc_defs.h>
-#include <asm/shared/io.h>
 
 #include "tdx.h"
 
@@ -35,6 +34,7 @@
 
 #define BOOT_BOOT_H
 #include "../ctype.h"
+#include "../io.h"
 
 #ifdef CONFIG_X86_64
 #define memptr long
diff --git a/arch/x86/boot/early_serial_console.c b/arch/x86/boot/early_serial_console.c
index 023bf1c3de8b..03e43d770571 100644
--- a/arch/x86/boot/early_serial_console.c
+++ b/arch/x86/boot/early_serial_console.c
@@ -28,17 +28,17 @@ static void early_serial_init(int port, int baud)
 	unsigned char c;
 	unsigned divisor;
 
-	outb(0x3, port + LCR);	/* 8n1 */
-	outb(0, port + IER);	/* no interrupt */
-	outb(0, port + FCR);	/* no fifo */
-	outb(0x3, port + MCR);	/* DTR + RTS */
+	pio_ops.outb(0x3, port + LCR);	/* 8n1 */
+	pio_ops.outb(0, port + IER);	/* no interrupt */
+	pio_ops.outb(0, port + FCR);	/* no fifo */
+	pio_ops.outb(0x3, port + MCR);	/* DTR + RTS */
 
 	divisor	= 115200 / baud;
-	c = inb(port + LCR);
-	outb(c | DLAB, port + LCR);
-	outb(divisor & 0xff, port + DLL);
-	outb((divisor >> 8) & 0xff, port + DLH);
-	outb(c & ~DLAB, port + LCR);
+	c = pio_ops.inb(port + LCR);
+	pio_ops.outb(c | DLAB, port + LCR);
+	pio_ops.outb(divisor & 0xff, port + DLL);
+	pio_ops.outb((divisor >> 8) & 0xff, port + DLH);
+	pio_ops.outb(c & ~DLAB, port + LCR);
 
 	early_serial_base = port;
 }
@@ -104,11 +104,11 @@ static unsigned int probe_baud(int port)
 	unsigned char lcr, dll, dlh;
 	unsigned int quot;
 
-	lcr = inb(port + LCR);
-	outb(lcr | DLAB, port + LCR);
-	dll = inb(port + DLL);
-	dlh = inb(port + DLH);
-	outb(lcr, port + LCR);
+	lcr = pio_ops.inb(port + LCR);
+	pio_ops.outb(lcr | DLAB, port + LCR);
+	dll = pio_ops.inb(port + DLL);
+	dlh = pio_ops.inb(port + DLH);
+	pio_ops.outb(lcr, port + LCR);
 	quot = (dlh << 8) | dll;
 
 	return BASE_BAUD / quot;
diff --git a/arch/x86/boot/io.h b/arch/x86/boot/io.h
new file mode 100644
index 000000000000..87dc8ee5d15f
--- /dev/null
+++ b/arch/x86/boot/io.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BOOT_IO_H
+#define BOOT_IO_H
+
+#include <asm/shared/io.h>
+
+struct port_io_ops {
+	u8 (*inb)(u16 port);
+	u16 (*inw)(u16 port);
+	u32 (*inl)(u16 port);
+	void (*outb)(u8 v, u16 port);
+	void (*outw)(u16 v, u16 port);
+	void (*outl)(u32 v, u16 port);
+};
+
+extern struct port_io_ops pio_ops;
+
+/*
+ * Use the normal I/O instructions by default.
+ * TDX guests override these to use hypercalls.
+ */
+static inline void init_default_io_ops(void)
+{
+	pio_ops.inb = inb;
+	pio_ops.inw = inw;
+	pio_ops.inl = inl;
+	pio_ops.outb = outb;
+	pio_ops.outw = outw;
+	pio_ops.outl = outl;
+}
+
+#endif
diff --git a/arch/x86/boot/main.c b/arch/x86/boot/main.c
index e3add857c2c9..1202d4f8a390 100644
--- a/arch/x86/boot/main.c
+++ b/arch/x86/boot/main.c
@@ -17,6 +17,8 @@
 
 struct boot_params boot_params __attribute__((aligned(16)));
 
+struct port_io_ops pio_ops;
+
 char *HEAP = _end;
 char *heap_end = _end;		/* Default end of heap = no heap */
 
@@ -133,6 +135,8 @@ static void init_heap(void)
 
 void main(void)
 {
+	init_default_io_ops();
+
 	/* First, copy the boot header into the "zeropage" */
 	copy_boot_params();
 
diff --git a/arch/x86/boot/pm.c b/arch/x86/boot/pm.c
index 40031a614712..4180b6a264c9 100644
--- a/arch/x86/boot/pm.c
+++ b/arch/x86/boot/pm.c
@@ -25,7 +25,7 @@ static void realmode_switch_hook(void)
 			     : "eax", "ebx", "ecx", "edx");
 	} else {
 		asm volatile("cli");
-		outb(0x80, 0x70); /* Disable NMI */
+		pio_ops.outb(0x80, 0x70); /* Disable NMI */
 		io_delay();
 	}
 }
@@ -35,9 +35,9 @@ static void realmode_switch_hook(void)
  */
 static void mask_all_interrupts(void)
 {
-	outb(0xff, 0xa1);	/* Mask all interrupts on the secondary PIC */
+	pio_ops.outb(0xff, 0xa1);	/* Mask all interrupts on the secondary PIC */
 	io_delay();
-	outb(0xfb, 0x21);	/* Mask all but cascade on the primary PIC */
+	pio_ops.outb(0xfb, 0x21);	/* Mask all but cascade on the primary PIC */
 	io_delay();
 }
 
@@ -46,9 +46,9 @@ static void mask_all_interrupts(void)
  */
 static void reset_coprocessor(void)
 {
-	outb(0, 0xf0);
+	pio_ops.outb(0, 0xf0);
 	io_delay();
-	outb(0, 0xf1);
+	pio_ops.outb(0, 0xf1);
 	io_delay();
 }
 
diff --git a/arch/x86/boot/tty.c b/arch/x86/boot/tty.c
index f7eb976b0a4b..ee8700682801 100644
--- a/arch/x86/boot/tty.c
+++ b/arch/x86/boot/tty.c
@@ -29,10 +29,10 @@ static void __section(".inittext") serial_putchar(int ch)
 {
 	unsigned timeout = 0xffff;
 
-	while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
+	while ((pio_ops.inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
 		cpu_relax();
 
-	outb(ch, early_serial_base + TXR);
+	pio_ops.outb(ch, early_serial_base + TXR);
 }
 
 static void __section(".inittext") bios_putchar(int ch)
diff --git a/arch/x86/boot/video-vga.c b/arch/x86/boot/video-vga.c
index 4816cb9cf996..17baac542ee7 100644
--- a/arch/x86/boot/video-vga.c
+++ b/arch/x86/boot/video-vga.c
@@ -131,7 +131,7 @@ static void vga_set_80x43(void)
 /* I/O address of the VGA CRTC */
 u16 vga_crtc(void)
 {
-	return (inb(0x3cc) & 1) ? 0x3d4 : 0x3b4;
+	return (pio_ops.inb(0x3cc) & 1) ? 0x3d4 : 0x3b4;
 }
 
 static void vga_set_480_scanlines(void)
@@ -148,10 +148,10 @@ static void vga_set_480_scanlines(void)
 	out_idx(0xdf, crtc, 0x12); /* Vertical display end */
 	out_idx(0xe7, crtc, 0x15); /* Vertical blank start */
 	out_idx(0x04, crtc, 0x16); /* Vertical blank end */
-	csel = inb(0x3cc);
+	csel = pio_ops.inb(0x3cc);
 	csel &= 0x0d;
 	csel |= 0xe2;
-	outb(csel, 0x3c2);
+	pio_ops.outb(csel, 0x3c2);
 }
 
 static void vga_set_vertical_end(int lines)
diff --git a/arch/x86/boot/video.h b/arch/x86/boot/video.h
index 04bde0bb2003..87a5f726e731 100644
--- a/arch/x86/boot/video.h
+++ b/arch/x86/boot/video.h
@@ -15,6 +15,8 @@
 
 #include <linux/types.h>
 
+#include "boot.h"
+
 /*
  * This code uses an extended set of video mode numbers. These include:
  * Aliases for standard modes
@@ -96,13 +98,13 @@ extern int graphic_mode;	/* Graphics mode with linear frame buffer */
 /* Accessing VGA indexed registers */
 static inline u8 in_idx(u16 port, u8 index)
 {
-	outb(index, port);
-	return inb(port+1);
+	pio_ops.outb(index, port);
+	return pio_ops.inb(port+1);
 }
 
 static inline void out_idx(u8 v, u16 port, u8 index)
 {
-	outw(index+(v << 8), port);
+	pio_ops.outw(index+(v << 8), port);
 }
 
 /* Writes a value to an indexed port and then reads the port again */
diff --git a/arch/x86/realmode/rm/wakemain.c b/arch/x86/realmode/rm/wakemain.c
index 1d6437e6d2ba..8c2eb2a829f1 100644
--- a/arch/x86/realmode/rm/wakemain.c
+++ b/arch/x86/realmode/rm/wakemain.c
@@ -17,18 +17,18 @@ static void beep(unsigned int hz)
 	} else {
 		u16 div = 1193181/hz;
 
-		outb(0xb6, 0x43);	/* Ctr 2, squarewave, load, binary */
+		pio_ops.outb(0xb6, 0x43);	/* Ctr 2, squarewave, load, binary */
 		io_delay();
-		outb(div, 0x42);	/* LSB of counter */
+		pio_ops.outb(div, 0x42);	/* LSB of counter */
 		io_delay();
-		outb(div >> 8, 0x42);	/* MSB of counter */
+		pio_ops.outb(div >> 8, 0x42);	/* MSB of counter */
 		io_delay();
 
 		enable = 0x03;		/* Turn on speaker */
 	}
-	inb(0x61);		/* Dummy read of System Control Port B */
+	pio_ops.inb(0x61);		/* Dummy read of System Control Port B */
 	io_delay();
-	outb(enable, 0x61);	/* Enable timer 2 output to speaker */
+	pio_ops.outb(enable, 0x61);	/* Enable timer 2 output to speaker */
 	io_delay();
 }
 
@@ -62,8 +62,12 @@ static void send_morse(const char *pattern)
 	}
 }
 
+struct port_io_ops pio_ops;
+
 void main(void)
 {
+	init_default_io_ops();
+
 	/* Kill machine if structures are wrong */
 	if (wakeup_header.real_magic != 0x12345678)
 		while (1)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 16/30] x86/boot: Port I/O: add decompression-time support for TDX
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (14 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 15/30] x86/boot: Port I/O: allow to hook up alternative helpers Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 17/30] x86/tdx: Port I/O: add runtime hypercalls Kirill A. Shutemov
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov,
	Dave Hansen

Port I/O instructions trigger #VE in the TDX environment. In response to
the exception, kernel emulates these instructions using hypercalls.

But during early boot, on the decompression stage, it is cumbersome to
deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
handling.

Hook up TDX-specific port I/O helpers if booting in TDX environment.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile |  2 +-
 arch/x86/boot/compressed/tdcall.S |  3 ++
 arch/x86/boot/compressed/tdx.c    | 72 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/shared/tdx.h | 29 +++++++++++++
 arch/x86/include/asm/tdx.h        | 24 -----------
 5 files changed, 105 insertions(+), 25 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdcall.S

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 732f6b21ecbd..8fd0e6ae2e1f 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -101,7 +101,7 @@ ifdef CONFIG_X86_64
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
-vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
new file mode 100644
index 000000000000..59b80ab6b41c
--- /dev/null
+++ b/arch/x86/boot/compressed/tdcall.S
@@ -0,0 +1,3 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include "../../coco/tdcall.S"
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index dec68c184358..0d88339dcc41 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -2,6 +2,10 @@
 
 #include "../cpuflags.h"
 #include "../string.h"
+#include "../io.h"
+
+#include <vdso/limits.h>
+#include <uapi/asm/vmx.h>
 
 #include <asm/shared/tdx.h>
 
@@ -12,6 +16,66 @@ bool early_is_tdx_guest(void)
 	return tdx_guest_detected;
 }
 
+static inline unsigned int tdx_io_in(int size, u16 port)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = EXIT_REASON_IO_INSTRUCTION,
+		.r12 = size,
+		.r13 = 0,
+		.r14 = port,
+	};
+
+	if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
+		return UINT_MAX;
+
+	return args.r11;
+}
+
+static inline void tdx_io_out(int size, u16 port, u32 value)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = EXIT_REASON_IO_INSTRUCTION,
+		.r12 = size,
+		.r13 = 1,
+		.r14 = port,
+		.r15 = value,
+	};
+
+	__tdx_hypercall(&args, 0);
+}
+
+static inline u8 tdx_inb(u16 port)
+{
+	return tdx_io_in(1, port);
+}
+
+static inline u16 tdx_inw(u16 port)
+{
+	return tdx_io_in(2, port);
+}
+
+static inline u32 tdx_inl(u16 port)
+{
+	return tdx_io_in(4, port);
+}
+
+static inline void tdx_outb(u8 value, u16 port)
+{
+	tdx_io_out(1, port, value);
+}
+
+static inline void tdx_outw(u16 value, u16 port)
+{
+	tdx_io_out(2, port, value);
+}
+
+static inline void tdx_outl(u32 value, u16 port)
+{
+	tdx_io_out(4, port, value);
+}
+
 void early_tdx_detect(void)
 {
 	u32 eax, sig[3];
@@ -24,4 +88,12 @@ void early_tdx_detect(void)
 
 	/* Cache TDX guest feature status */
 	tdx_guest_detected = true;
+
+	/* Use hypercalls instead of I/O instructions */
+	pio_ops.inb = tdx_inb;
+	pio_ops.inw = tdx_inw;
+	pio_ops.inl = tdx_inl;
+	pio_ops.outb = tdx_outb;
+	pio_ops.outw = tdx_outw;
+	pio_ops.outl = tdx_outl;
 }
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 8209ba9ffe1a..51bce6351124 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -2,7 +2,36 @@
 #ifndef _ASM_X86_SHARED_TDX_H
 #define _ASM_X86_SHARED_TDX_H
 
+#include <linux/bits.h>
+#include <linux/types.h>
+
+#define TDX_HYPERCALL_STANDARD  0
+
+#define TDX_HCALL_HAS_OUTPUT	BIT(0)
+#define TDX_HCALL_ISSUE_STI	BIT(1)
+
 #define TDX_CPUID_LEAF_ID	0x21
 #define TDX_IDENT		"IntelTDX    "
 
+#ifndef __ASSEMBLY__
+
+/*
+ * Used in __tdx_hypercall() to pass down and get back registers' values of
+ * the TDCALL instruction when requesting services from the VMM.
+ *
+ * This is a software only structure and not part of the TDX module/VMM ABI.
+ */
+struct tdx_hypercall_args {
+	u64 r10;
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+};
+
+/* Used to request services from the VMM */
+u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
+
+#endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_SHARED_TDX_H */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 76cffbda0e79..10f39bec7c7d 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -3,16 +3,10 @@
 #ifndef _ASM_X86_TDX_H
 #define _ASM_X86_TDX_H
 
-#include <linux/bits.h>
 #include <linux/init.h>
 #include <asm/ptrace.h>
 #include <asm/shared/tdx.h>
 
-#define TDX_HYPERCALL_STANDARD  0
-
-#define TDX_HCALL_HAS_OUTPUT	BIT(0)
-#define TDX_HCALL_ISSUE_STI	BIT(1)
-
 /*
  * SW-defined error codes.
  *
@@ -40,21 +34,6 @@ struct tdx_module_output {
 	u64 r11;
 };
 
-/*
- * Used in __tdx_hypercall() to pass down and get back registers' values of
- * the TDCALL instruction when requesting services from the VMM.
- *
- * This is a software only structure and not part of the TDX module/VMM ABI.
- */
-struct tdx_hypercall_args {
-	u64 r10;
-	u64 r11;
-	u64 r12;
-	u64 r13;
-	u64 r14;
-	u64 r15;
-};
-
 /*
  * Used by the #VE exception handler to gather the #VE exception
  * info from the TDX module. This is a software only structure
@@ -79,9 +58,6 @@ void __init tdx_early_init(void);
 u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 		      struct tdx_module_output *out);
 
-/* Used to request services from the VMM */
-u64 __tdx_hypercall(struct tdx_hypercall_args *args, unsigned long flags);
-
 void tdx_get_ve_info(struct ve_info *ve);
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 17/30] x86/tdx: Port I/O: add runtime hypercalls
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (15 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 16/30] x86/boot: Port I/O: add decompression-time support for TDX Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-08 21:30   ` Dave Hansen
  2022-03-02 14:27 ` [PATCHv5 18/30] x86/tdx: Port I/O: add early boot support Kirill A. Shutemov
                   ` (12 subsequent siblings)
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

TDX hypervisors cannot emulate instructions directly. This includes
port I/O which is normally emulated in the hypervisor. All port I/O
instructions inside TDX trigger the #VE exception in the guest and
would be normally emulated there.

Use a hypercall to emulate port I/O. Extend the
tdx_handle_virt_exception() and add support to handle the #VE due to
port I/O instructions.

String I/O operations are not supported in TDX. Unroll them by declaring
CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute.

== Userspace Implications ==

The ioperm() facility allows userspace access to I/O instructions like
inb/outb.  Among other things, this allows writing userspace device
drivers.

This series has no special handling for ioperm(). Users will be able to
successfully request I/O permissions but will induce a #VE on their
first I/O instruction. If this is undesirable users can enable kernel
lockdown feature with 'lockdown=integrity' kernel command line option.
It makes ioperm() fail.

More robust handling of this situation (denying ioperm() in all TDX
guests) will be addressed in follow-on work.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/core.c |  7 +++-
 arch/x86/coco/tdx.c  | 79 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 9113baebbfd2..5615b75e6fc6 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -18,7 +18,12 @@ static u64 cc_mask __ro_after_init;
 
 static bool intel_cc_platform_has(enum cc_attr attr)
 {
-	return false;
+	switch (attr) {
+	case CC_ATTR_GUEST_UNROLL_STRING_IO:
+		return true;
+	default:
+		return false;
+	}
 }
 
 /*
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index e6163e7e3247..1f58375f61df 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -19,6 +19,16 @@
 #define EPT_READ	0
 #define EPT_WRITE	1
 
+/* Port I/O direction */
+#define PORT_READ	0
+#define PORT_WRITE	1
+
+/* See Exit Qualification for I/O Instructions in VMX documentation */
+#define VE_IS_IO_IN(e)		((e) & BIT(3))
+#define VE_GET_IO_SIZE(e)	(((e) & GENMASK(2, 0)) + 1)
+#define VE_GET_PORT_NUM(e)	((e) >> 16)
+#define VE_IS_IO_STRING(e)	((e) & BIT(4))
+
 /*
  * Wrapper for standard use of __tdx_hypercall with no output aside from
  * return code.
@@ -312,6 +322,73 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 	return true;
 }
 
+static bool handle_in(struct pt_regs *regs, int size, int port)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION),
+		.r12 = size,
+		.r13 = PORT_READ,
+		.r14 = port,
+	};
+	bool success;
+	u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
+
+	/*
+	 * Emulate the I/O read via hypercall. More info about ABI can be found
+	 * in TDX Guest-Host-Communication Interface (GHCI) section titled
+	 * "TDG.VP.VMCALL<Instruction.IO>".
+	 */
+	success = !__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT);
+
+	/* Update part of the register affected by the emulated instruction */
+	regs->ax &= ~mask;
+	if (success)
+		regs->ax |= args.r11 & mask;
+
+	return success;
+}
+
+static bool handle_out(struct pt_regs *regs, int size, int port)
+{
+	u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
+
+	/*
+	 * Emulate the I/O write via hypercall. More info about ABI can be found
+	 * in TDX Guest-Host-Communication Interface (GHCI) section titled
+	 * "TDG.VP.VMCALL<Instruction.IO>".
+	 */
+	return !_tdx_hypercall(hcall_func(EXIT_REASON_IO_INSTRUCTION), size,
+			       PORT_WRITE, port, regs->ax & mask);
+}
+
+/*
+ * Emulate I/O using hypercall.
+ *
+ * Assumes the IO instruction was using ax, which is enforced
+ * by the standard io.h macros.
+ *
+ * Return True on success or False on failure.
+ */
+static bool handle_io(struct pt_regs *regs, u32 exit_qual)
+{
+	bool in;
+	int size, port;
+
+	if (VE_IS_IO_STRING(exit_qual))
+		return false;
+
+	in   = VE_IS_IO_IN(exit_qual);
+	size = VE_GET_IO_SIZE(exit_qual);
+	port = VE_GET_PORT_NUM(exit_qual);
+
+
+	if (in)
+		return handle_in(regs, size, port);
+	else
+		return handle_out(regs, size, port);
+}
+
 void tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
@@ -361,6 +438,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 		return handle_cpuid(regs);
 	case EXIT_REASON_EPT_VIOLATION:
 		return handle_mmio(regs, ve);
+	case EXIT_REASON_IO_INSTRUCTION:
+		return handle_io(regs, ve->exit_qual);
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return false;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 18/30] x86/tdx: Port I/O: add early boot support
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (16 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 17/30] x86/tdx: Port I/O: add runtime hypercalls Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 19/30] x86/tdx: Wire up KVM hypercalls Kirill A. Shutemov
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A . Shutemov,
	Dave Hansen

From: Andi Kleen <ak@linux.intel.com>

TDX guests cannot do port I/O directly. The TDX module triggers a #VE
exception to let the guest kernel emulate port I/O by converting them
into TDCALLs to call the host.

But before IDT handlers are set up, port I/O cannot be emulated using
normal kernel #VE handlers. To support the #VE-based emulation during
this boot window, add a minimal early #VE handler support in early
exception handlers. This is similar to what AMD SEV does. This is
mainly to support earlyprintk's serial driver, as well as potentially
the VGA driver.

The early handler only supports I/O-related #VE exceptions. Unhandled or
failed exceptions will be handled via early_fixup_exceptions() (like
normal exception failures). At runtime I/O-related #VE exceptions (along
with other types) handled by virt_exception_kernel().

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/coco/tdx.c        | 16 ++++++++++++++++
 arch/x86/include/asm/tdx.h |  4 ++++
 arch/x86/kernel/head64.c   |  3 +++
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 1f58375f61df..391a05c7b1da 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -389,6 +389,22 @@ static bool handle_io(struct pt_regs *regs, u32 exit_qual)
 		return handle_out(regs, size, port);
 }
 
+/*
+ * Early #VE exception handler. Only handles a subset of port I/O.
+ * Intended only for earlyprintk. If failed, return false.
+ */
+__init bool tdx_early_handle_ve(struct pt_regs *regs)
+{
+	struct ve_info ve;
+
+	tdx_get_ve_info(&ve);
+
+	if (ve.exit_reason != EXIT_REASON_IO_INSTRUCTION)
+		return false;
+
+	return handle_io(regs, ve.exit_qual);
+}
+
 void tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 10f39bec7c7d..c20062698198 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -64,11 +64,15 @@ bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
 
 void tdx_safe_halt(void);
 
+bool tdx_early_handle_ve(struct pt_regs *regs);
+
 #else
 
 static inline void tdx_early_init(void) { };
 static inline void tdx_safe_halt(void) { };
 
+static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
+
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 6dff50c3edd6..ecbf50e5b8e0 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -417,6 +417,9 @@ void __init do_early_exception(struct pt_regs *regs, int trapnr)
 	    trapnr == X86_TRAP_VC && handle_vc_boot_ghcb(regs))
 		return;
 
+	if (trapnr == X86_TRAP_VE && tdx_early_handle_ve(regs))
+		return;
+
 	early_fixup_exception(regs, trapnr);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 19/30] x86/tdx: Wire up KVM hypercalls
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (17 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 18/30] x86/tdx: Port I/O: add early boot support Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 20/30] x86/boot: Add a trampoline for booting APs via firmware handoff Kirill A. Shutemov
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

KVM hypercalls use the VMCALL or VMMCALL instructions. Although the ABI
is similar, those instructions no longer function for TDX guests.

Make vendor-specific TDVMCALLs instead of VMCALL. This enables TDX
guests to run with KVM acting as the hypervisor.

Among other things, KVM hypercall is used to send IPIs.

Since the KVM driver can be built as a kernel module, export
tdx_kvm_hypercall() to make the symbols visible to kvm.ko.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/coco/tdx.c             | 17 +++++++++++++++++
 arch/x86/include/asm/kvm_para.h | 22 ++++++++++++++++++++++
 arch/x86/include/asm/tdx.h      | 11 +++++++++++
 3 files changed, 50 insertions(+)

diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 391a05c7b1da..c82e8eda8c8b 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -58,6 +58,23 @@ static u64 hcall_func(u64 exit_reason)
 	return exit_reason;
 }
 
+#ifdef CONFIG_KVM_GUEST
+long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
+		       unsigned long p3, unsigned long p4)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = nr,
+		.r11 = p1,
+		.r12 = p2,
+		.r13 = p3,
+		.r14 = p4,
+	};
+
+	return __tdx_hypercall(&args, 0);
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall);
+#endif
+
 static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 				   struct tdx_module_output *out)
 {
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 56935ebb1dfe..57bc74e112f2 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -7,6 +7,8 @@
 #include <linux/interrupt.h>
 #include <uapi/asm/kvm_para.h>
 
+#include <asm/tdx.h>
+
 #ifdef CONFIG_KVM_GUEST
 bool kvm_check_and_clear_guest_paused(void);
 #else
@@ -32,6 +34,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 static inline long kvm_hypercall0(unsigned int nr)
 {
 	long ret;
+
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr)
@@ -42,6 +48,10 @@ static inline long kvm_hypercall0(unsigned int nr)
 static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
 {
 	long ret;
+
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return tdx_kvm_hypercall(nr, p1, 0, 0, 0);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1)
@@ -53,6 +63,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
 				  unsigned long p2)
 {
 	long ret;
+
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return tdx_kvm_hypercall(nr, p1, p2, 0, 0);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2)
@@ -64,6 +78,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
 				  unsigned long p2, unsigned long p3)
 {
 	long ret;
+
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return tdx_kvm_hypercall(nr, p1, p2, p3, 0);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -76,6 +94,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 				  unsigned long p4)
 {
 	long ret;
+
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index c20062698198..db8bf9a86b97 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -75,5 +75,16 @@ static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
+#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_INTEL_TDX_GUEST)
+long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
+		       unsigned long p3, unsigned long p4);
+#else
+static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
+				     unsigned long p2, unsigned long p3,
+				     unsigned long p4)
+{
+	return -ENODEV;
+}
+#endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 20/30] x86/boot: Add a trampoline for booting APs via firmware handoff
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (18 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 19/30] x86/tdx: Wire up KVM hypercalls Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 21/30] x86/acpi, x86/boot: Add multiprocessor wake-up support Kirill A. Shutemov
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Sean Christopherson,
	Kirill A . Shutemov

From: Sean Christopherson <sean.j.christopherson@intel.com>

Historically, x86 platforms have booted secondary processors (APs)
using INIT followed by the start up IPI (SIPI) messages. In regular
VMs, this boot sequence is supported by the VMM emulation. But such a
wakeup model is fatal for secure VMs like TDX in which VMM is an
untrusted entity. To address this issue, a new wakeup model was added
in ACPI v6.4, in which firmware (like TDX virtual BIOS) will help boot
the APs. More details about this wakeup model can be found in ACPI
specification v6.4, the section titled "Multiprocessor Wakeup Structure".

Since the existing trampoline code requires processors to boot in real
mode with 16-bit addressing, it will not work for this wakeup model
(because it boots the AP in 64-bit mode). To handle it, extend the
trampoline code to support 64-bit mode firmware handoff. Also, extend
IDT and GDT pointers to support 64-bit mode hand off.

There is no TDX-specific detection for this new boot method. The kernel
will rely on it as the sole boot method whenever the new ACPI structure
is present.

The ACPI table parser for the MADT multiprocessor wake up structure and
the wakeup method that uses this structure will be added by the following
patch in this series.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/apic.h              |  2 ++
 arch/x86/include/asm/realmode.h          |  1 +
 arch/x86/kernel/smpboot.c                | 12 ++++++--
 arch/x86/realmode/rm/header.S            |  1 +
 arch/x86/realmode/rm/trampoline_64.S     | 38 ++++++++++++++++++++++++
 arch/x86/realmode/rm/trampoline_common.S | 12 +++++++-
 6 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 48067af94678..35006e151774 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -328,6 +328,8 @@ struct apic {
 
 	/* wakeup_secondary_cpu */
 	int	(*wakeup_secondary_cpu)(int apicid, unsigned long start_eip);
+	/* wakeup secondary CPU using 64-bit wakeup point */
+	int	(*wakeup_secondary_cpu_64)(int apicid, unsigned long start_eip);
 
 	void	(*inquire_remote_apic)(int apicid);
 
diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 331474b150f1..fd6f6e5b755a 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -25,6 +25,7 @@ struct real_mode_header {
 	u32	sev_es_trampoline_start;
 #endif
 #ifdef CONFIG_X86_64
+	u32	trampoline_start64;
 	u32	trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 617012f4619f..6269dd126dba 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1088,6 +1088,11 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
 	unsigned long boot_error = 0;
 	unsigned long timeout;
 
+#ifdef CONFIG_X86_64
+	/* If 64-bit wakeup method exists, use the 64-bit mode trampoline IP */
+	if (apic->wakeup_secondary_cpu_64)
+		start_ip = real_mode_header->trampoline_start64;
+#endif
 	idle->thread.sp = (unsigned long)task_pt_regs(idle);
 	early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
 	initial_code = (unsigned long)start_secondary;
@@ -1129,11 +1134,14 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
 
 	/*
 	 * Wake up a CPU in difference cases:
-	 * - Use the method in the APIC driver if it's defined
+	 * - Use a method from the APIC driver if one defined, with wakeup
+	 *   straight to 64-bit mode preferred over wakeup to RM.
 	 * Otherwise,
 	 * - Use an INIT boot APIC message for APs or NMI for BSP.
 	 */
-	if (apic->wakeup_secondary_cpu)
+	if (apic->wakeup_secondary_cpu_64)
+		boot_error = apic->wakeup_secondary_cpu_64(apicid, start_ip);
+	else if (apic->wakeup_secondary_cpu)
 		boot_error = apic->wakeup_secondary_cpu(apicid, start_ip);
 	else
 		boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 8c1db5bf5d78..2eb62be6d256 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
 	.long	pa_sev_es_trampoline_start
 #endif
 #ifdef CONFIG_X86_64
+	.long	pa_trampoline_start64
 	.long	pa_trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index cc8391f86cdb..ae112a91592f 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
 	ljmpl	$__KERNEL_CS, $pa_startup_64
 SYM_CODE_END(startup_32)
 
+SYM_CODE_START(pa_trampoline_compat)
+	/*
+	 * In compatibility mode.  Prep ESP and DX for startup_32, then disable
+	 * paging and complete the switch to legacy 32-bit mode.
+	 */
+	movl	$rm_stack_end, %esp
+	movw	$__KERNEL_DS, %dx
+
+	movl	$X86_CR0_PE, %eax
+	movl	%eax, %cr0
+	ljmpl   $__KERNEL32_CS, $pa_startup_32
+SYM_CODE_END(pa_trampoline_compat)
+
 	.section ".text64","ax"
 	.code64
 	.balign 4
@@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
 	jmpq	*tr_start(%rip)
 SYM_CODE_END(startup_64)
 
+SYM_CODE_START(trampoline_start64)
+	/*
+	 * APs start here on a direct transfer from 64-bit BIOS with identity
+	 * mapped page tables.  Load the kernel's GDT in order to gear down to
+	 * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
+	 * segment registers.  Load the zero IDT so any fault triggers a
+	 * shutdown instead of jumping back into BIOS.
+	 */
+	lidt	tr_idt(%rip)
+	lgdt	tr_gdt64(%rip)
+
+	ljmpl	*tr_compat(%rip)
+SYM_CODE_END(trampoline_start64)
+
 	.section ".rodata","a"
 	# Duplicate the global descriptor table
 	# so the kernel can live anywhere
@@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
 	.quad	0x00cf93000000ffff	# __KERNEL_DS
 SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
 
+SYM_DATA_START(tr_gdt64)
+	.short	tr_gdt_end - tr_gdt - 1	# gdt limit
+	.long	pa_tr_gdt
+	.long	0
+SYM_DATA_END(tr_gdt64)
+
+SYM_DATA_START(tr_compat)
+	.long	pa_trampoline_compat
+	.short	__KERNEL32_CS
+SYM_DATA_END(tr_compat)
+
 	.bss
 	.balign	PAGE_SIZE
 SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
index 5033e640f957..4331c32c47f8 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -1,4 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 	.section ".rodata","a"
 	.balign	16
-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+
+/*
+ * When a bootloader hands off to the kernel in 32-bit mode an
+ * IDT with a 2-byte limit and 4-byte base is needed. When a boot
+ * loader hands off to a kernel 64-bit mode the base address
+ * extends to 8-bytes. Reserve enough space for either scenario.
+ */
+SYM_DATA_START_LOCAL(tr_idt)
+	.short  0
+	.quad   0
+SYM_DATA_END(tr_idt)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 21/30] x86/acpi, x86/boot: Add multiprocessor wake-up support
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (19 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 20/30] x86/boot: Add a trampoline for booting APs via firmware handoff Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-02 14:27 ` [PATCHv5 22/30] x86/boot: Set CR0.NE early and keep it set during the boot Kirill A. Shutemov
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Sean Christopherson,
	Rafael J . Wysocki, Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

TDX cannot use INIT/SIPI protocol to bring up secondary CPUs because it
requires assistance from untrusted VMM.

For platforms that do not support SIPI/INIT, ACPI defines a wakeup
model (using mailbox) via MADT multiprocessor wakeup structure. More
details about it can be found in ACPI specification v6.4, the section
titled "Multiprocessor Wakeup Structure". If a platform firmware
produces the multiprocessor wakeup structure, then OS may use this
new mailbox-based mechanism to wake up the APs.

Add ACPI MADT wake structure parsing support for x86 platform and if
MADT wake table is present, update apic->wakeup_secondary_cpu_64 with
new API which uses MADT wake mailbox to wake-up CPU.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/apic.h |   5 ++
 arch/x86/kernel/acpi/boot.c | 118 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/apic/apic.c |  10 +++
 3 files changed, 133 insertions(+)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 35006e151774..bd8ae0a7010a 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -490,6 +490,11 @@ static inline unsigned int read_apic_id(void)
 	return apic->get_apic_id(reg);
 }
 
+#ifdef CONFIG_X86_64
+typedef int (*wakeup_cpu_handler)(int apicid, unsigned long start_eip);
+extern void acpi_wake_cpu_handler_update(wakeup_cpu_handler handler);
+#endif
+
 extern int default_apic_id_valid(u32 apicid);
 extern int default_acpi_madt_oem_check(char *, char *);
 extern void default_setup_apic_routing(void);
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 5b6d1a95776f..99518eac2bbc 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -65,6 +65,15 @@ static u64 acpi_lapic_addr __initdata = APIC_DEFAULT_PHYS_BASE;
 static bool acpi_support_online_capable;
 #endif
 
+#ifdef CONFIG_X86_64
+/* Physical address of the Multiprocessor Wakeup Structure mailbox */
+static u64 acpi_mp_wake_mailbox_paddr;
+/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
+static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
+/* Lock to protect mailbox (acpi_mp_wake_mailbox) from parallel access */
+static DEFINE_SPINLOCK(mailbox_lock);
+#endif
+
 #ifdef CONFIG_X86_IO_APIC
 /*
  * Locks related to IOAPIC hotplug
@@ -336,6 +345,84 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e
 	return 0;
 }
 
+#ifdef CONFIG_X86_64
+/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
+static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
+{
+	static physid_mask_t apic_id_wakemap = PHYSID_MASK_NONE;
+	u8 timeout;
+
+	/* Remap mailbox memory only for the first call to acpi_wakeup_cpu() */
+	if (physids_empty(apic_id_wakemap)) {
+		acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
+						sizeof(*acpi_mp_wake_mailbox),
+						MEMREMAP_WB);
+	}
+
+	/*
+	 * According to the ACPI specification r6.4, section titled
+	 * "Multiprocessor Wakeup Structure" the mailbox-based wakeup
+	 * mechanism cannot be used more than once for the same CPU.
+	 * Skip wakeups if they are attempted more than once.
+	 */
+	if (physid_isset(apicid, apic_id_wakemap)) {
+		pr_err("CPU already awake (APIC ID %x), skipping wakeup\n",
+		       apicid);
+		return -EINVAL;
+	}
+
+	spin_lock(&mailbox_lock);
+
+	/*
+	 * Mailbox memory is shared between firmware and OS. Firmware will
+	 * listen on mailbox command address, and once it receives the wakeup
+	 * command, CPU associated with the given apicid will be booted.
+	 *
+	 * The value of apic_id and wakeup_vector has to be set before updating
+	 * the wakeup command. To let compiler preserve order of writes, use
+	 * smp_store_release.
+	 */
+	smp_store_release(&acpi_mp_wake_mailbox->apic_id, apicid);
+	smp_store_release(&acpi_mp_wake_mailbox->wakeup_vector, start_ip);
+	smp_store_release(&acpi_mp_wake_mailbox->command,
+			  ACPI_MP_WAKE_COMMAND_WAKEUP);
+
+	/*
+	 * After writing the wakeup command, wait for maximum timeout of 0xFF
+	 * for firmware to reset the command address back zero to indicate
+	 * the successful reception of command.
+	 * NOTE: 0xFF as timeout value is decided based on our experiments.
+	 *
+	 * XXX: Change the timeout once ACPI specification comes up with
+	 *      standard maximum timeout value.
+	 */
+	timeout = 0xFF;
+	while (READ_ONCE(acpi_mp_wake_mailbox->command) && --timeout)
+		cpu_relax();
+
+	/* If timed out (timeout == 0), return error */
+	if (!timeout) {
+		/*
+		 * XXX: Is there a recovery path after timeout is hit?
+		 * Spec is unclear. Reset command to 0 if timeout is hit.
+		 */
+		acpi_mp_wake_mailbox->command = 0;
+		spin_unlock(&mailbox_lock);
+		return -EIO;
+	}
+
+	/*
+	 * If the CPU wakeup process is successful, store the
+	 * status in apic_id_wakemap to prevent re-wakeup
+	 * requests.
+	 */
+	physid_set(apicid, apic_id_wakemap);
+
+	spin_unlock(&mailbox_lock);
+
+	return 0;
+}
+#endif
 #endif				/*CONFIG_X86_LOCAL_APIC */
 
 #ifdef CONFIG_X86_IO_APIC
@@ -1083,6 +1170,29 @@ static int __init acpi_parse_madt_lapic_entries(void)
 	}
 	return 0;
 }
+
+#ifdef CONFIG_X86_64
+static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+				     const unsigned long end)
+{
+	struct acpi_madt_multiproc_wakeup *mp_wake;
+
+	if (!IS_ENABLED(CONFIG_SMP))
+		return -ENODEV;
+
+	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
+	if (BAD_MADT_ENTRY(mp_wake, end))
+		return -EINVAL;
+
+	acpi_table_print_madt_entry(&header->common);
+
+	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+
+	acpi_wake_cpu_handler_update(acpi_wakeup_cpu);
+
+	return 0;
+}
+#endif				/* CONFIG_X86_64 */
 #endif				/* CONFIG_X86_LOCAL_APIC */
 
 #ifdef	CONFIG_X86_IO_APIC
@@ -1278,6 +1388,14 @@ static void __init acpi_process_madt(void)
 
 				smp_found_config = 1;
 			}
+
+#ifdef CONFIG_X86_64
+			/*
+			 * Parse MADT MP Wake entry.
+			 */
+			acpi_table_parse_madt(ACPI_MADT_TYPE_MULTIPROC_WAKEUP,
+					      acpi_parse_mp_wake, 1);
+#endif
 		}
 		if (error == -EINVAL) {
 			/*
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b70344bf6600..3c8f2c797a98 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2551,6 +2551,16 @@ u32 x86_msi_msg_get_destid(struct msi_msg *msg, bool extid)
 }
 EXPORT_SYMBOL_GPL(x86_msi_msg_get_destid);
 
+#ifdef CONFIG_X86_64
+void __init acpi_wake_cpu_handler_update(wakeup_cpu_handler handler)
+{
+	struct apic **drv;
+
+	for (drv = __apicdrivers; drv < __apicdrivers_end; drv++)
+		(*drv)->wakeup_secondary_cpu_64 = handler;
+}
+#endif
+
 /*
  * Override the generic EOI implementation with an optimized version.
  * Only called during early boot when only one CPU is active and with
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 22/30] x86/boot: Set CR0.NE early and keep it set during the boot
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (20 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 21/30] x86/acpi, x86/boot: Add multiprocessor wake-up support Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-08 21:37   ` Dave Hansen
  2022-03-02 14:27 ` [PATCHv5 23/30] x86/boot: Avoid #VE during boot for TDX platforms Kirill A. Shutemov
                   ` (7 subsequent siblings)
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov

TDX guest requires CR0.NE to be set. Clearing the bit triggers #GP(0).

If CR0.NE is 0, the MS-DOS compatibility mode for handling floating-point
exceptions is selected. In this mode, the software exception handler for
floating-point exceptions is invoked externally using the processor’s
FERR#, INTR, and IGNNE# pins.

Using FERR# and IGNNE# to handle floating-point exception is deprecated.
CR0.NE=0 also limits newer processors to operate with one logical
processor active.

Kernel uses CR0_STATE constant to initialize CR0. It has NE bit set.
But during early boot kernel has more ad-hoc approach to setting bit
in the register.

Make CR0 initialization consistent, deriving the initial value of CR0
from CR0_STATE.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S   | 7 ++++---
 arch/x86/realmode/rm/trampoline_64.S | 8 ++++----
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index fd9441f40457..d0c3d33f3542 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -289,7 +289,7 @@ SYM_FUNC_START(startup_32)
 	pushl	%eax
 
 	/* Enter paged protected Mode, activating Long Mode */
-	movl	$(X86_CR0_PG | X86_CR0_PE), %eax /* Enable Paging and Protected mode */
+	movl	$CR0_STATE, %eax
 	movl	%eax, %cr0
 
 	/* Jump from 32bit compatibility mode into 64bit mode. */
@@ -662,8 +662,9 @@ SYM_CODE_START(trampoline_32bit_src)
 	pushl	$__KERNEL_CS
 	pushl	%eax
 
-	/* Enable paging again */
-	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
+	/* Enable paging again. */
+	movl	%cr0, %eax
+	btsl	$X86_CR0_PG_BIT, %eax
 	movl	%eax, %cr0
 
 	lret
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index ae112a91592f..d380f2d1fd23 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -70,7 +70,7 @@ SYM_CODE_START(trampoline_start)
 	movw	$__KERNEL_DS, %dx	# Data segment descriptor
 
 	# Enable protected mode
-	movl	$X86_CR0_PE, %eax	# protected mode (PE) bit
+	movl	$(CR0_STATE & ~X86_CR0_PG), %eax
 	movl	%eax, %cr0		# into protected mode
 
 	# flush prefetch and jump to startup_32
@@ -148,8 +148,8 @@ SYM_CODE_START(startup_32)
 	movl	$MSR_EFER, %ecx
 	wrmsr
 
-	# Enable paging and in turn activate Long Mode
-	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+	# Enable paging and in turn activate Long Mode.
+	movl	$CR0_STATE, %eax
 	movl	%eax, %cr0
 
 	/*
@@ -169,7 +169,7 @@ SYM_CODE_START(pa_trampoline_compat)
 	movl	$rm_stack_end, %esp
 	movw	$__KERNEL_DS, %dx
 
-	movl	$X86_CR0_PE, %eax
+	movl	$(CR0_STATE & ~X86_CR0_PG), %eax
 	movl	%eax, %cr0
 	ljmpl   $__KERNEL32_CS, $pa_startup_32
 SYM_CODE_END(pa_trampoline_compat)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 23/30] x86/boot: Avoid #VE during boot for TDX platforms
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (21 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 22/30] x86/boot: Set CR0.NE early and keep it set during the boot Kirill A. Shutemov
@ 2022-03-02 14:27 ` Kirill A. Shutemov
  2022-03-07  9:29   ` Xiaoyao Li
  2022-03-02 14:28 ` [PATCHv5 24/30] x86/topology: Disable CPU online/offline control for TDX guests Kirill A. Shutemov
                   ` (6 subsequent siblings)
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A . Shutemov

From: Sean Christopherson <seanjc@google.com>

There are a few MSRs and control register bits that the kernel
normally needs to modify during boot. But, TDX disallows
modification of these registers to help provide consistent security
guarantees. Fortunately, TDX ensures that these are all in the correct
state before the kernel loads, which means the kernel does not need to
modify them.

The conditions to avoid are:

 * Any writes to the EFER MSR
 * Clearing CR3.MCE

This theoretically makes the guest boot more fragile. If, for instance,
EFER was set up incorrectly and a WRMSR was performed, it will trigger
early exception panic or a triple fault, if it's before early
exceptions are set up. However, this is likely to trip up the guest
BIOS long before control reaches the kernel. In any case, these kinds
of problems are unlikely to occur in production environments, and
developers have good debug tools to fix them quickly.

Change the common boot code to work on TDX and non-TDX systems.
This should have no functional effect on non-TDX systems.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                     |  1 +
 arch/x86/boot/compressed/head_64.S   | 20 ++++++++++++++++++--
 arch/x86/boot/compressed/pgtable.h   |  2 +-
 arch/x86/kernel/head_64.S            | 28 ++++++++++++++++++++++++++--
 arch/x86/realmode/rm/trampoline_64.S | 13 ++++++++++++-
 5 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d2f45e58e846..98efb35ed7b1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -886,6 +886,7 @@ config INTEL_TDX_GUEST
 	depends on X86_X2APIC
 	select ARCH_HAS_CC_PLATFORM
 	select DYNAMIC_PHYSICAL_MASK
+	select X86_MCE
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d0c3d33f3542..6d903b2fc544 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -643,12 +643,28 @@ SYM_CODE_START(trampoline_32bit_src)
 	movl	$MSR_EFER, %ecx
 	rdmsr
 	btsl	$_EFER_LME, %eax
+	/* Avoid writing EFER if no change was made (for TDX guest) */
+	jc	1f
 	wrmsr
-	popl	%edx
+1:	popl	%edx
 	popl	%ecx
 
+#ifdef CONFIG_X86_MCE
+	/*
+	 * Preserve CR4.MCE if the kernel will enable #MC support.
+	 * Clearing MCE may fault in some environments (that also force #MC
+	 * support). Any machine check that occurs before #MC support is fully
+	 * configured will crash the system regardless of the CR4.MCE value set
+	 * here.
+	 */
+	movl	%cr4, %eax
+	andl	$X86_CR4_MCE, %eax
+#else
+	movl	$0, %eax
+#endif
+
 	/* Enable PAE and LA57 (if required) paging modes */
-	movl	$X86_CR4_PAE, %eax
+	orl	$X86_CR4_PAE, %eax
 	testl	%edx, %edx
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
 
 #define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE	0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE	0x80
 
 #define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
 
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9c63fc5988cd..184b7468ea76 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -140,8 +140,22 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	addq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
+#ifdef CONFIG_X86_MCE
+	/*
+	 * Preserve CR4.MCE if the kernel will enable #MC support.
+	 * Clearing MCE may fault in some environments (that also force #MC
+	 * support). Any machine check that occurs before #MC support is fully
+	 * configured will crash the system regardless of the CR4.MCE value set
+	 * here.
+	 */
+	movq	%cr4, %rcx
+	andl	$X86_CR4_MCE, %ecx
+#else
+	movl	$0, %ecx
+#endif
+
 	/* Enable PAE mode, PGE and LA57 */
-	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+	orl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
 #ifdef CONFIG_X86_5LEVEL
 	testl	$1, __pgtable_l5_enabled(%rip)
 	jz	1f
@@ -246,13 +260,23 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	/* Setup EFER (Extended Feature Enable Register) */
 	movl	$MSR_EFER, %ecx
 	rdmsr
+	/*
+	 * Preserve current value of EFER for comparison and to skip
+	 * EFER writes if no change was made (for TDX guest)
+	 */
+	movl    %eax, %edx
 	btsl	$_EFER_SCE, %eax	/* Enable System Call */
 	btl	$20,%edi		/* No Execute supported? */
 	jnc     1f
 	btsl	$_EFER_NX, %eax
 	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
-1:	wrmsr				/* Make changes effective */
 
+	/* Avoid writing EFER if no change was made (for TDX guest) */
+1:	cmpl	%edx, %eax
+	je	1f
+	xor	%edx, %edx
+	wrmsr				/* Make changes effective */
+1:
 	/* Setup cr0 */
 	movl	$CR0_STATE, %eax
 	/* Make changes effective */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index d380f2d1fd23..e38d61d6562e 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,11 +143,22 @@ SYM_CODE_START(startup_32)
 	movl	%eax, %cr3
 
 	# Set up EFER
+	movl	$MSR_EFER, %ecx
+	rdmsr
+	/*
+	 * Skip writing to EFER if the register already has desired
+	 * value (to avoid #VE for the TDX guest).
+	 */
+	cmp	pa_tr_efer, %eax
+	jne	.Lwrite_efer
+	cmp	pa_tr_efer + 4, %edx
+	je	.Ldone_efer
+.Lwrite_efer:
 	movl	pa_tr_efer, %eax
 	movl	pa_tr_efer + 4, %edx
-	movl	$MSR_EFER, %ecx
 	wrmsr
 
+.Ldone_efer:
 	# Enable paging and in turn activate Long Mode.
 	movl	$CR0_STATE, %eax
 	movl	%eax, %cr0
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 24/30] x86/topology: Disable CPU online/offline control for TDX guests
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (22 preceding siblings ...)
  2022-03-02 14:27 ` [PATCHv5 23/30] x86/boot: Avoid #VE during boot for TDX platforms Kirill A. Shutemov
@ 2022-03-02 14:28 ` Kirill A. Shutemov
  2022-03-02 14:28 ` [PATCHv5 25/30] x86/tdx: Make pages shared in ioremap() Kirill A. Shutemov
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Unlike regular VMs, TDX guests use the firmware hand-off wakeup method
to wake up the APs during the boot process. This wakeup model uses a
mailbox to communicate with firmware to bring up the APs. As per the
design, this mailbox can only be used once for the given AP, which means
after the APs are booted, the same mailbox cannot be used to
offline/online the given AP. More details about this requirement can be
found in Intel TDX Virtual Firmware Design Guide, sec titled "AP
initialization in OS" and in sec titled "Hotplug Device".

Since the architecture does not support any method of offlining the
CPUs, disable CPU hotplug support in the kernel.

Since this hotplug disable feature can be re-used by other VM guests,
add a new CC attribute CC_ATTR_HOTPLUG_DISABLED and use it to disable
the hotplug support.

With hotplug disabled, /sys/devices/system/cpu/cpuX/online sysfs option
will not exist for TDX guests.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/core.c        |  1 +
 include/linux/cc_platform.h | 10 ++++++++++
 kernel/cpu.c                |  7 +++++++
 3 files changed, 18 insertions(+)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 5615b75e6fc6..54344122e2fe 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -20,6 +20,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
 {
 	switch (attr) {
 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
+	case CC_ATTR_HOTPLUG_DISABLED:
 		return true;
 	default:
 		return false;
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index efd8205282da..691494bbaf5a 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -72,6 +72,16 @@ enum cc_attr {
 	 * Examples include TDX guest & SEV.
 	 */
 	CC_ATTR_GUEST_UNROLL_STRING_IO,
+
+	/**
+	 * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled.
+	 *
+	 * The platform/OS is running as a guest/virtual machine does not
+	 * support CPU hotplug feature.
+	 *
+	 * Examples include TDX Guest.
+	 */
+	CC_ATTR_HOTPLUG_DISABLED,
 };
 
 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
diff --git a/kernel/cpu.c b/kernel/cpu.c
index f39eb0b52dfe..c94f00fa34d3 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -34,6 +34,7 @@
 #include <linux/scs.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/cpuset.h>
+#include <linux/cc_platform.h>
 
 #include <trace/events/power.h>
 #define CREATE_TRACE_POINTS
@@ -1185,6 +1186,12 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
 
 static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
 {
+	/*
+	 * If the platform does not support hotplug, report it explicitly to
+	 * differentiate it from a transient offlining failure.
+	 */
+	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED))
+		return -EOPNOTSUPP;
 	if (cpu_hotplug_disabled)
 		return -EBUSY;
 	return _cpu_down(cpu, 0, target);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 25/30] x86/tdx: Make pages shared in ioremap()
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (23 preceding siblings ...)
  2022-03-02 14:28 ` [PATCHv5 24/30] x86/topology: Disable CPU online/offline control for TDX guests Kirill A. Shutemov
@ 2022-03-02 14:28 ` Kirill A. Shutemov
  2022-03-08 22:02   ` Dave Hansen
  2022-03-02 14:28 ` [PATCHv5 26/30] x86/mm/cpa: Add support for TDX shared memory Kirill A. Shutemov
                   ` (4 subsequent siblings)
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov

In TDX guests, guest memory is protected from host access. If a guest
performs I/O, it needs to explicitly share the I/O memory with the host.

Make all ioremap()ed pages that are not backed by normal memory
(IORES_DESC_NONE or IORES_DESC_RESERVED) mapped as shared.

Since TDX memory encryption support is similar to AMD SEV architecture,
reuse the infrastructure from AMD SEV code.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/ioremap.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 026031b3b782..a5d4ec1afca2 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -242,10 +242,15 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
 	 * If the page being mapped is in memory and SEV is active then
 	 * make sure the memory encryption attribute is enabled in the
 	 * resulting mapping.
+	 * In TDX guests, memory is marked private by default. If encryption
+	 * is not requested (using encrypted), explicitly set decrypt
+	 * attribute in all IOREMAPPED memory.
 	 */
 	prot = PAGE_KERNEL_IO;
 	if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
 		prot = pgprot_encrypted(prot);
+	else
+		prot = pgprot_decrypted(prot);
 
 	switch (pcm) {
 	case _PAGE_CACHE_MODE_UC:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 26/30] x86/mm/cpa: Add support for TDX shared memory
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (24 preceding siblings ...)
  2022-03-02 14:28 ` [PATCHv5 25/30] x86/tdx: Make pages shared in ioremap() Kirill A. Shutemov
@ 2022-03-02 14:28 ` Kirill A. Shutemov
  2022-03-09 19:44   ` Dave Hansen
  2022-03-02 14:28 ` [PATCHv5 27/30] x86/kvm: Use bounce buffers for TD guest Kirill A. Shutemov
                   ` (3 subsequent siblings)
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov

Intel TDX protects guest memory from VMM access. Any memory that is
required for communication with the VMM must be explicitly shared.

It is a two-step process: the guest sets the shared bit in the page
table entry and notifies VMM about the change. The notification happens
using MapGPA hypercall.

Conversion back to private memory requires clearing the shared bit,
notifying VMM with MapGPA hypercall following with accepting the memory
with AcceptPage hypercall.

Provide a TDX version of x86_platform.guest.* callbacks. It makes
__set_memory_enc_pgtable() work right in TDX guest.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/core.c    |   1 +
 arch/x86/coco/tdx.c     | 101 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps.c |   2 +-
 3 files changed, 103 insertions(+), 1 deletion(-)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 54344122e2fe..9778cf4c6901 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -21,6 +21,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
 	switch (attr) {
 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
 	case CC_ATTR_HOTPLUG_DISABLED:
+	case CC_ATTR_GUEST_MEM_ENCRYPT:
 		return true;
 	default:
 		return false;
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index c82e8eda8c8b..2168ee25a52c 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -10,10 +10,15 @@
 #include <asm/vmx.h>
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
+#include <asm/x86_init.h>
 
 /* TDX module Call Leaf IDs */
 #define TDX_GET_INFO			1
 #define TDX_GET_VEINFO			3
+#define TDX_ACCEPT_PAGE			6
+
+/* TDX hypercall Leaf IDs */
+#define TDVMCALL_MAP_GPA		0x10001
 
 /* MMIO direction */
 #define EPT_READ	0
@@ -495,6 +500,98 @@ bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
 	return ret;
 }
 
+static bool tdx_tlb_flush_required(bool enc)
+{
+	/*
+	 * TDX guest is responsible for flushing caches on private->shared
+	 * transition. VMM is responsible for flushing on shared->private.
+	 */
+	return !enc;
+}
+
+static bool tdx_cache_flush_required(void)
+{
+	return true;
+}
+
+static bool accept_page(phys_addr_t gpa, enum pg_level pg_level)
+{
+	/*
+	 * Pass the page physical address to the TDX module to accept the
+	 * pending, private page.
+	 *
+	 * Bits 2:0 of GPA encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
+	 */
+	switch (pg_level) {
+	case PG_LEVEL_4K:
+		break;
+	case PG_LEVEL_2M:
+		gpa |= 1;
+		break;
+	case PG_LEVEL_1G:
+		gpa |= 2;
+		break;
+	default:
+		return false;
+	}
+
+	return !__tdx_module_call(TDX_ACCEPT_PAGE, gpa, 0, 0, 0, NULL);
+}
+
+/*
+ * Inform the VMM of the guest's intent for this physical page: shared with
+ * the VMM or private to the guest.  The VMM is expected to change its mapping
+ * of the page in response.
+ */
+static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+{
+	phys_addr_t start = __pa(vaddr);
+	phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+
+	if (!enc) {
+		start |= cc_mkdec(0);
+		end |= cc_mkdec(0);
+	}
+
+	/*
+	 * Notify the VMM about page mapping conversion. More info about ABI
+	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
+	 * section "TDG.VP.VMCALL<MapGPA>"
+	 */
+	if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
+		return false;
+
+	/* private->shared conversion  requires only MapGPA call */
+	if (!enc)
+		return true;
+
+	/*
+	 * For shared->private conversion, accept the page using
+	 * TDX_ACCEPT_PAGE TDX module call.
+	 */
+	while (start < end) {
+		/* Try if 1G page accept is possible */
+		if (!(start & ~PUD_MASK) && end - start >= PUD_SIZE &&
+		    accept_page(start, PG_LEVEL_1G)) {
+			start += PUD_SIZE;
+			continue;
+		}
+
+		/* Try if 2M page accept is possible */
+		if (!(start & ~PMD_MASK) && end - start >= PMD_SIZE &&
+		    accept_page(start, PG_LEVEL_2M)) {
+			start += PMD_SIZE;
+			continue;
+		}
+
+		if (!accept_page(start, PG_LEVEL_4K))
+			return false;
+		start += PAGE_SIZE;
+	}
+
+	return true;
+}
+
 void __init tdx_early_init(void)
 {
 	unsigned int gpa_width;
@@ -526,5 +623,9 @@ void __init tdx_early_init(void)
 	 */
 	cc_set_mask(BIT_ULL(gpa_width - 1));
 
+	x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
+	x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
+	x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
+
 	pr_info("Guest detected\n");
 }
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 1c3cb952fa2a..080f21171b27 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1308,7 +1308,7 @@ static void ve_raise_fault(struct pt_regs *regs, long error_code)
  *
  * In the settings that Linux will run in, virtualization exceptions are
  * never generated on accesses to normal, TD-private memory that has been
- * accepted.
+ * accepted (by BIOS or with tdx_enc_status_changed()).
  *
  * Syscall entry code has a critical window where the kernel stack is not
  * yet set up. Any exception in this window leads to hard to debug issues
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 27/30] x86/kvm: Use bounce buffers for TD guest
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (25 preceding siblings ...)
  2022-03-02 14:28 ` [PATCHv5 26/30] x86/mm/cpa: Add support for TDX shared memory Kirill A. Shutemov
@ 2022-03-02 14:28 ` Kirill A. Shutemov
  2022-03-09 20:07   ` Dave Hansen
  2022-03-02 14:28 ` [PATCHv5 28/30] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kirill A. Shutemov
                   ` (2 subsequent siblings)
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov

Intel TDX doesn't allow VMM to directly access guest private memory.
Any memory that is required for communication with the VMM must be
shared explicitly. The same rule applies for any DMA to and from the
TDX guest. All DMA pages have to be marked as shared pages. A generic way
to achieve this without any changes to device drivers is to use the
SWIOTLB framework.

Force SWIOTLB on TD guest and make SWIOTLB buffer shared by generalizing
mem_encrypt_init() to cover TDX.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                   | 2 +-
 arch/x86/coco/core.c               | 1 +
 arch/x86/coco/tdx.c                | 3 +++
 arch/x86/include/asm/mem_encrypt.h | 6 +++---
 arch/x86/mm/mem_encrypt.c          | 9 ++++++++-
 5 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 98efb35ed7b1..1312cefb927d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -885,7 +885,7 @@ config INTEL_TDX_GUEST
 	depends on X86_64 && CPU_SUP_INTEL
 	depends on X86_X2APIC
 	select ARCH_HAS_CC_PLATFORM
-	select DYNAMIC_PHYSICAL_MASK
+	select X86_MEM_ENCRYPT
 	select X86_MCE
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 9778cf4c6901..b10326f91d4f 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -22,6 +22,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
 	case CC_ATTR_HOTPLUG_DISABLED:
 	case CC_ATTR_GUEST_MEM_ENCRYPT:
+	case CC_ATTR_MEM_ENCRYPT:
 		return true;
 	default:
 		return false;
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index 2168ee25a52c..429a1ba42667 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -5,6 +5,7 @@
 #define pr_fmt(fmt)     "tdx: " fmt
 
 #include <linux/cpufeature.h>
+#include <linux/swiotlb.h>
 #include <asm/coco.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
@@ -627,5 +628,7 @@ void __init tdx_early_init(void)
 	x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
 	x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
 
+	swiotlb_force = SWIOTLB_FORCE;
+
 	pr_info("Guest detected\n");
 }
diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index e2c6f433ed10..88ceaf3648b3 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -49,9 +49,6 @@ void __init early_set_mem_enc_dec_hypercall(unsigned long vaddr, int npages,
 
 void __init mem_encrypt_free_decrypted_mem(void);
 
-/* Architecture __weak replacement functions */
-void __init mem_encrypt_init(void);
-
 void __init sev_es_init_vc_handling(void);
 
 #define __bss_decrypted __section(".bss..decrypted")
@@ -89,6 +86,9 @@ static inline void mem_encrypt_free_decrypted_mem(void) { }
 
 #endif	/* CONFIG_AMD_MEM_ENCRYPT */
 
+/* Architecture __weak replacement functions */
+void __init mem_encrypt_init(void);
+
 /*
  * The __sme_pa() and __sme_pa_nodebug() macros are meant for use when
  * writing to or comparing values from the cr3 register.  Having the
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 50d209939c66..10ee40b5204b 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -42,7 +42,14 @@ bool force_dma_unencrypted(struct device *dev)
 
 static void print_mem_encrypt_feature_info(void)
 {
-	pr_info("AMD Memory Encryption Features active:");
+	pr_info("Memory Encryption Features active:");
+
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+		pr_cont(" Intel TDX\n");
+		return;
+	}
+
+	pr_cont("AMD ");
 
 	/* Secure Memory Encryption */
 	if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 28/30] x86/tdx: ioapic: Add shared bit for IOAPIC base address
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (26 preceding siblings ...)
  2022-03-02 14:28 ` [PATCHv5 27/30] x86/kvm: Use bounce buffers for TD guest Kirill A. Shutemov
@ 2022-03-02 14:28 ` Kirill A. Shutemov
  2022-03-09 20:39   ` Dave Hansen
  2022-03-02 14:28 ` [PATCHv5 29/30] ACPICA: Avoid cache flush inside virtual machines Kirill A. Shutemov
  2022-03-02 14:28 ` [PATCHv5 30/30] Documentation/x86: Document TDX kernel architecture Kirill A. Shutemov
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Isaku Yamahata,
	Kirill A . Shutemov

From: Isaku Yamahata <isaku.yamahata@intel.com>

The kernel interacts with each bare-metal IOAPIC with a special
MMIO page. When running under KVM, the guest's IOAPICs are
emulated by KVM.

When running as a TDX guest, the guest needs to mark each IOAPIC
mapping as "shared" with the host.  This ensures that TDX private
protections are not applied to the page, which allows the TDX host
emulation to work.

ioremap()-created mappings such as virtio will be marked as
shared by default. However, the IOAPIC code does not use ioremap() and
instead uses the fixmap mechanism.

Introduce a special fixmap helper just for the IOAPIC code.  Ensure
that it marks IOAPIC pages as "shared".  This replaces
set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
allows custom 'prot' values.

AMD SEV gets IOAPIC pages shared because FIXMAP_PAGE_NOCACHE has _ENC
bit clear. TDX has to set bit to share the page with the host.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/apic/io_apic.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index c1bb384935b0..d775f58a3c3e 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -49,6 +49,7 @@
 #include <linux/slab.h>
 #include <linux/memblock.h>
 #include <linux/msi.h>
+#include <linux/cc_platform.h>
 
 #include <asm/irqdomain.h>
 #include <asm/io.h>
@@ -65,6 +66,7 @@
 #include <asm/irq_remapping.h>
 #include <asm/hw_irq.h>
 #include <asm/apic.h>
+#include <asm/pgtable.h>
 
 #define	for_each_ioapic(idx)		\
 	for ((idx) = 0; (idx) < nr_ioapics; (idx)++)
@@ -2677,6 +2679,15 @@ static struct resource * __init ioapic_setup_resources(void)
 	return res;
 }
 
+static void io_apic_set_fixmap_nocache(enum fixed_addresses idx,
+				       phys_addr_t phys)
+{
+	pgprot_t flags = FIXMAP_PAGE_NOCACHE;
+
+	flags = pgprot_decrypted(flags);
+	__set_fixmap(idx, phys, flags);
+}
+
 void __init io_apic_init_mappings(void)
 {
 	unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
@@ -2709,7 +2720,7 @@ void __init io_apic_init_mappings(void)
 				      __func__, PAGE_SIZE, PAGE_SIZE);
 			ioapic_phys = __pa(ioapic_phys);
 		}
-		set_fixmap_nocache(idx, ioapic_phys);
+		io_apic_set_fixmap_nocache(idx, ioapic_phys);
 		apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
 			__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
 			ioapic_phys);
@@ -2838,7 +2849,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
 	ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
 	ioapics[idx].mp_config.apicaddr = address;
 
-	set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
+	io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
 	if (bad_ioapic_register(idx)) {
 		clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
 		return -ENODEV;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 29/30] ACPICA: Avoid cache flush inside virtual machines
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (27 preceding siblings ...)
  2022-03-02 14:28 ` [PATCHv5 28/30] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kirill A. Shutemov
@ 2022-03-02 14:28 ` Kirill A. Shutemov
  2022-03-02 16:13   ` Dan Williams
  2022-03-09 20:56   ` Dave Hansen
  2022-03-02 14:28 ` [PATCHv5 30/30] Documentation/x86: Document TDX kernel architecture Kirill A. Shutemov
  29 siblings, 2 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A. Shutemov

WBINVD is not supported in TDX guest and triggers #VE. There's no robust
way to emulate it. The kernel has to avoid it.

ACPI_FLUSH_CPU_CACHE() flushes caches usign WBINVD on entering sleep
states. It is required to prevent data loss.

While running inside virtual machine, the kernel can bypass cache
flushing. Changing sleep state in a virtual machine doesn't affect the
host system sleep state and cannot lead to data loss.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/acenv.h | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/acenv.h b/arch/x86/include/asm/acenv.h
index 9aff97f0de7f..d937c55e717e 100644
--- a/arch/x86/include/asm/acenv.h
+++ b/arch/x86/include/asm/acenv.h
@@ -13,7 +13,19 @@
 
 /* Asm macros */
 
-#define ACPI_FLUSH_CPU_CACHE()	wbinvd()
+/*
+ * ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states.
+ * It is required to prevent data loss.
+ *
+ * While running inside virtual machine, the kernel can bypass cache flushing.
+ * Changing sleep state in a virtual machine doesn't affect the host system
+ * sleep state and cannot lead to data loss.
+ */
+#define ACPI_FLUSH_CPU_CACHE()					\
+do {								\
+	if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))	\
+		wbinvd();					\
+} while (0)
 
 int __acpi_acquire_global_lock(unsigned int *lock);
 int __acpi_release_global_lock(unsigned int *lock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5 30/30] Documentation/x86: Document TDX kernel architecture
  2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (28 preceding siblings ...)
  2022-03-02 14:28 ` [PATCHv5 29/30] ACPICA: Avoid cache flush inside virtual machines Kirill A. Shutemov
@ 2022-03-02 14:28 ` Kirill A. Shutemov
  2022-03-09 21:49   ` Dave Hansen
  29 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-02 14:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Document the TDX guest architecture details like #VE support,
shared memory, etc.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/x86/index.rst |   1 +
 Documentation/x86/tdx.rst   | 214 ++++++++++++++++++++++++++++++++++++
 2 files changed, 215 insertions(+)
 create mode 100644 Documentation/x86/tdx.rst

diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index f498f1d36cd3..382e53ca850a 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -24,6 +24,7 @@ x86-specific Documentation
    intel-iommu
    intel_txt
    amd-memory-encryption
+   tdx
    pti
    mds
    microcode
diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
new file mode 100644
index 000000000000..8ca60256511b
--- /dev/null
+++ b/Documentation/x86/tdx.rst
@@ -0,0 +1,214 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+Intel Trust Domain Extensions (TDX)
+=====================================
+
+Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
+the host and physical attacks by isolating the guest register state and by
+encrypting the guest memory. In TDX, a special TDX module sits between the
+host and the guest, and runs in a special mode and manages the guest/host
+separation.
+
+Since the host cannot directly access guest registers or memory, much
+normal functionality of a hypervisor must be moved into the guest. This is
+implemented using a Virtualization Exception (#VE) that is handled by the
+guest kernel. Some #VEs are handled entirely inside the guest kernel, but
+some require the hypervisor to be involved.
+
+TDX includes new hypercall-like mechanisms for communicating from the
+guest to the hypervisor or the TDX module.
+
+New TDX Exceptions
+==================
+
+TDX guests behave differently from bare-metal and traditional VMX guests.
+In TDX guests, otherwise normal instructions or memory accesses can cause
+#VE or #GP exceptions.
+
+Instructions marked with an '*' conditionally cause exceptions.  The
+details for these instructions are discussed below.
+
+Instruction-based #VE
+---------------------
+
+- Port I/O (INS, OUTS, IN, OUT)
+- HLT
+- MONITOR, MWAIT
+- WBINVD, INVD
+- VMCALL
+- RDMSR*,WRMSR*
+- CPUID*
+
+Instruction-based #GP
+---------------------
+
+- All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
+  VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
+- ENCLS, ENCLU
+- GETSEC
+- RSM
+- ENQCMD
+- RDMSR*,WRMSR*
+
+RDMSR/WRMSR Behavior
+--------------------
+
+MSR access behavior falls into three categories:
+
+- #GP generated
+- #VE generated
+- "Just works"
+
+In general, the #GP MSRs should not be used in guests.  Their use likely
+indicates a bug in the guest.  The guest may try to handle the #GP with a
+hypercall but it is unlikely to succeed.
+
+The #VE MSRs are typically able to be handled by the hypervisor.  Guests
+can make a hypercall to the hypervisor to handle the #VE.
+
+The "just works" MSRs do not need any special guest handling.  They might
+be implemented by directly passing through the MSR to the hardware or by
+trapping and handling in the TDX module.  Other than possibly being slow,
+these MSRs appear to function just as they would on bare metal.
+
+CPUID Behavior
+--------------
+
+For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
+return values (in guest EAX/EBX/ECX/EDX) are configurable by the
+hypervisor. For such cases, the Intel TDX module architecture defines two
+virtualization types:
+
+- Bit fields for which the hypervisor configures the value seen by the
+  guest TD.
+
+- Bit fields for which the hypervisor configures the value such that the
+  guest TD either sees their native value or a value of 0
+
+#VE generated for CPUID leaves and sub-leaves that TDX module doesn't know
+how to handle. The guest kernel may ask the hypervisor for the value with
+a hypercall.
+
+#VE on Memory Accesses
+======================
+
+There are essentially two classes of TDX memory: private and shared.
+Private memory receives full TDX protections.  Its content is protected
+against access from the hypervisor.  Shared memory is expected to be
+shared between guest and hypervisor.
+
+A TD guest is in control of whether its memory accesses are treated as
+private or shared.  It selects the behavior with a bit in its page table
+entries.  This helps ensure that a guest does not place sensitive
+information in shared memory, exposing it to the untrusted hypervisor.
+
+#VE on Shared Memory
+--------------------
+
+Access to shared mappings can cause a #VE.  The hypervisor ultimately
+controls whether a shared memory access causes a #VE, so the guest must be
+careful to only reference shared pages it can safely handle a #VE.  For
+instance, the guest should be careful not to access shared memory in the
+#VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET).
+
+Shared mapping content is entirely controlled by the hypervisor. Shared
+mappings must never be used for sensitive memory content like stacks or
+kernel text, only for I/O buffers and MMIO regions.  A good rule of thumb
+is that hypervisor-shared memory should be treated the same as memory
+mapped to userspace.  Both the hypervisor and userspace are completely
+untrusted.
+
+MMIO for virtual devices is implemented as shared memory.  The guest must
+be careful not to access device MMIO regions unless it is also prepared to
+handle a #VE.
+
+#VE on Private Pages
+--------------------
+
+Accesses to private mappings can also cause #VEs.  Since all kernel memory
+is also private memory, the kernel might theoretically need to handle a
+#VE on arbitrary kernel memory accesses.  This is not feasible, so TDX
+guests ensure that all guest memory has been "accepted" before memory is
+used by the kernel.
+
+A modest amount of memory (typically 512M) is pre-accepted by the firmware
+before the kernel runs to ensure that the kernel can start up without
+being subjected to #VE's.
+
+The hypervisor is permitted to unilaterally move accepted pages to a
+"blocked" state. However, if it does this, page access will not generate a
+#VE.  It will, instead, cause a "TD Exit" where the hypervisor is required
+to handle the exception.
+
+Linux #VE handler
+=================
+
+Just like page faults or #GP's, #VE exceptions can be either handled or be
+fatal.  Typically, unhandled userspace #VE's result in a SIGSEGV.
+Unhandled kernel #VE's result in an oops.
+
+Handling nested exceptions on x86 is typically nasty business.  A #VE
+could be interrupted by an NMI which triggers another #VE and hilarity
+ensues.  TDX #VE's have a novel solution to make it slightly less nasty.
+
+During #VE handling, the TDX module ensures that all interrupts (including
+NMIs) are blocked.  The block remains in place until the guest makes a
+TDG.VP.VEINFO.GET TDCALL.  This allows the guest to choose when interrupts
+or new #VE's can be delivered.
+
+However, the guest kernel must still be careful to avoid potential
+#VE-triggering actions (discussed above) while this block is in place.
+While the block is in place, #VE's are elevated to double faults (#DF)
+which are not recoverable.
+
+MMIO handling
+=============
+
+In non-TDX VMs, MMIO is usually implemented by giving a guest access to
+a mapping which will cause a VMEXIT on access, and then the hypervisor emulates
+the access.  That is not possible in TDX guests because VMEXIT will expose the
+register state to the host. TDX guests don't trust the host and can't have
+their state exposed to the host.
+
+In TDX, the MMIO regions typically trigger a #VE exception in the guest.
+The guest #VE handler then emulates the MMIO instruction inside the guest
+and converts it into a controlled TDCALL to the host, rather than exposing
+guest state to the host.
+
+MMIO addresses on x86 are just special physical addresses. They can
+theoretically be accessed with any instruction that accesses memory.
+However, the kernel instruction decoding method is limited. It is only
+designed to decode instructions like those generated by io.h macros.
+
+MMIO access via other means (like structure overlays) may result in an
+oops.
+
+Shared Memory Conversions
+=========================
+
+All TDX guest memory starts out as private at boot.  This memory can not
+be accessed by the hypervisor.  However some kernel users like device
+drivers might have a need to share data with the hypervisor.  To do this,
+memory must be converted between shared and private.  This can be
+accomplished using some existing memory encryption helpers:
+
+set_memory_decrypted() converts a range of pages to shared.
+set_memory_encrypted() converts memory back to private.
+
+Device drivers are the primary user of shared memory, but there's no need
+to touch every driver. DMA buffers and ioremap()'ed do the conversions
+automatically.
+
+TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
+converted to shared on boot.
+
+For coherent DMA allocation, the DMA buffer gets converted on the
+allocation. Check force_dma_unencrypted() for details.
+
+References
+==========
+
+TDX reference material is collected here:
+
+https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 29/30] ACPICA: Avoid cache flush inside virtual machines
  2022-03-02 14:28 ` [PATCHv5 29/30] ACPICA: Avoid cache flush inside virtual machines Kirill A. Shutemov
@ 2022-03-02 16:13   ` Dan Williams
  2022-03-09 20:56   ` Dave Hansen
  1 sibling, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-03-02 16:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Kuppuswamy Sathyanarayanan,
	Andrea Arcangeli, Andi Kleen, David Hildenbrand, H. Peter Anvin,
	Juergen Gross, Jim Mattson, Joerg Roedel, Josh Poimboeuf,
	Kuppuswamy Sathyanarayanan, Paolo Bonzini, sdeep,
	Sean Christopherson, Luck, Tony, Vitaly Kuznetsov, Wanpeng Li,
	Tom Lendacky, Brijesh Singh, X86 ML, Linux Kernel Mailing List

On Wed, Mar 2, 2022 at 6:28 AM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> WBINVD is not supported in TDX guest and triggers #VE. There's no robust
> way to emulate it. The kernel has to avoid it.
>
> ACPI_FLUSH_CPU_CACHE() flushes caches usign WBINVD on entering sleep
> states. It is required to prevent data loss.
>
> While running inside virtual machine, the kernel can bypass cache
> flushing. Changing sleep state in a virtual machine doesn't affect the
> host system sleep state and cannot lead to data loss.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 15/30] x86/boot: Port I/O: allow to hook up alternative helpers
  2022-03-02 14:27 ` [PATCHv5 15/30] x86/boot: Port I/O: allow to hook up alternative helpers Kirill A. Shutemov
@ 2022-03-02 17:42   ` Josh Poimboeuf
  2022-03-02 19:41     ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Josh Poimboeuf @ 2022-03-02 17:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, knsathya, pbonzini, sdeep, seanjc,
	tony.luck, vkuznets, wanpengli, thomas.lendacky, brijesh.singh,
	x86, linux-kernel, Dave Hansen

On Wed, Mar 02, 2022 at 05:27:51PM +0300, Kirill A. Shutemov wrote:
> Port I/O instructions trigger #VE in the TDX environment. In response to
> the exception, kernel emulates these instructions using hypercalls.
> 
> But during early boot, on the decompression stage, it is cumbersome to
> deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
> handling.
> 
> Add a way to hook up alternative port I/O helpers in the boot stub with
> a new pio_ops structure.  For now, set the ops structure to just call
> the normal I/O operation functions.
> 
> The approach has down sides: TDX boot will fail if any code bypass
> pio_ops and go for direct port I/O helper. The failure will only be
> visible on TDX boot (or other user of alternative pio_ops).
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Sorry, but this is still not convincing.

As you said earlier, it's a judgement call.  So, detail all the
considerations which were used when making that call.

Why is this the best approach compared to other alternatives?  It needs
to convince the reader.

Supporting #VE -- by building on the existing #VC support -- seems more
robust than this hack.  Convince me (and other patch reviewers)
otherwise.

At the very least, please remove the ability for future code to
accidentally bypass 'pio_ops'.  Going forward, are we really expected to
just remember to always use pio_ops for i/o?  Or else TDX will just
silently break?  That's just not acceptable.

-- 
Josh


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 15/30] x86/boot: Port I/O: allow to hook up alternative helpers
  2022-03-02 17:42   ` Josh Poimboeuf
@ 2022-03-02 19:41     ` Dave Hansen
  2022-03-02 20:02       ` Josh Poimboeuf
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2022-03-02 19:41 UTC (permalink / raw)
  To: Josh Poimboeuf, Kirill A. Shutemov
  Cc: tglx, mingo, bp, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, thomas.lendacky, brijesh.singh, x86, linux-kernel,
	Dave Hansen

On 3/2/22 09:42, Josh Poimboeuf wrote:
> At the very least, please remove the ability for future code to> accidentally bypass 'pio_ops'.  Going forward, are we really expected
to> just remember to always use pio_ops for i/o?  Or else TDX will just>
silently break?  That's just not acceptable.
What did you have in mind here?  The in/out() instruction wrappers could
be moved to a spot where they're impossible to call directly, for instance.

I guess we could get really fancy and use objtool to look for any I/O
instructions that show up outside of the "official" pio_ops copies.
That would prevent anyone using inline assembly.

In the end, though, TDX *is* a new sub-architecture.  There are lots of
ways it's going to break silently and nobody will notice on bare metal.
 SEV is the same way with things like the C (encryption) bit in the page
tables.  Adding more safeguards sounds like a good idea but, in the end,
we're going to have to find the non-obvious issues with testing.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 15/30] x86/boot: Port I/O: allow to hook up alternative helpers
  2022-03-02 19:41     ` Dave Hansen
@ 2022-03-02 20:02       ` Josh Poimboeuf
  0 siblings, 0 replies; 84+ messages in thread
From: Josh Poimboeuf @ 2022-03-02 20:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, knsathya, pbonzini, sdeep, seanjc,
	tony.luck, vkuznets, wanpengli, thomas.lendacky, brijesh.singh,
	x86, linux-kernel, Dave Hansen

On Wed, Mar 02, 2022 at 11:41:53AM -0800, Dave Hansen wrote:
> On 3/2/22 09:42, Josh Poimboeuf wrote:
> > At the very least, please remove the ability for future code to> accidentally bypass 'pio_ops'.  Going forward, are we really expected
> to> just remember to always use pio_ops for i/o?  Or else TDX will just>
> silently break?  That's just not acceptable.
> What did you have in mind here?  The in/out() instruction wrappers could
> be moved to a spot where they're impossible to call directly, for instance.

I guess, though why not just put the pio_ops crud in the inb/outb
wrappers themselves?

> I guess we could get really fancy and use objtool to look for any I/O
> instructions that show up outside of the "official" pio_ops copies.
> That would prevent anyone using inline assembly.

Yeah, there's no easy solution for asm and inline asm.  We would need
something like objtool to enforce the new "non-direct-i/o" policy in
boot code.  But objtool doesn't even validate boot code.

And it looks this patch missed an "outb"?

static inline void io_delay(void)
{
	const u16 DELAY_PORT = 0x80;
	asm volatile("outb %%al,%0" : : "dN" (DELAY_PORT));
}

> In the end, though, TDX *is* a new sub-architecture.  There are lots of
> ways it's going to break silently and nobody will notice on bare metal.
>  SEV is the same way with things like the C (encryption) bit in the page
> tables.  Adding more safeguards sounds like a good idea but, in the end,
> we're going to have to find the non-obvious issues with testing.

Right, but for this case there's no reason to destabilize TDX on
purpose.

-- 
Josh


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 01/30] x86/tdx: Detect running as a TDX guest in early boot
  2022-03-02 14:27 ` [PATCHv5 01/30] x86/tdx: Detect running as a TDX guest in early boot Kirill A. Shutemov
@ 2022-03-04 15:43   ` Borislav Petkov
  2022-03-04 15:47     ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Borislav Petkov @ 2022-03-04 15:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Dave Hansen

On Wed, Mar 02, 2022 at 05:27:37PM +0300, Kirill A. Shutemov wrote:
> +void __init tdx_early_init(void)
> +{
> +	u32 eax, sig[3];
> +
> +	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
> +
> +	BUILD_BUG_ON(sizeof(sig) != sizeof(TDX_IDENT) - 1);

That's new.

Is that pure paranoia or what are you protecting here against?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 01/30] x86/tdx: Detect running as a TDX guest in early boot
  2022-03-04 15:43   ` Borislav Petkov
@ 2022-03-04 15:47     ` Dave Hansen
  2022-03-04 16:02       ` Borislav Petkov
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2022-03-04 15:47 UTC (permalink / raw)
  To: Borislav Petkov, Kirill A. Shutemov
  Cc: tglx, mingo, luto, peterz, sathyanarayanan.kuppuswamy, aarcange,
	ak, dan.j.williams, david, hpa, jgross, jmattson, joro, jpoimboe,
	knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, thomas.lendacky, brijesh.singh, x86, linux-kernel,
	Dave Hansen

On 3/4/22 07:43, Borislav Petkov wrote:
> On Wed, Mar 02, 2022 at 05:27:37PM +0300, Kirill A. Shutemov wrote:
>> +void __init tdx_early_init(void)
>> +{
>> +	u32 eax, sig[3];
>> +
>> +	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
>> +
>> +	BUILD_BUG_ON(sizeof(sig) != sizeof(TDX_IDENT) - 1);
> That's new.
> 
> Is that pure paranoia or what are you protecting here against?

Pure reviewer paranoia. :)

  https://lore.kernel.org/all/YhN5edJQ+LkVc0us@grain/


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 01/30] x86/tdx: Detect running as a TDX guest in early boot
  2022-03-04 15:47     ` Dave Hansen
@ 2022-03-04 16:02       ` Borislav Petkov
  2022-03-07 22:24         ` [PATCHv5.1 " Kirill A. Shutemov
  0 siblings, 1 reply; 84+ messages in thread
From: Borislav Petkov @ 2022-03-04 16:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, tglx, mingo, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Dave Hansen

On Fri, Mar 04, 2022 at 07:47:37AM -0800, Dave Hansen wrote:
> Pure reviewer paranoia. :)
> 
>   https://lore.kernel.org/all/YhN5edJQ+LkVc0us@grain/

This is one of those things where when you look at them months, years
from now, you'd go "WTF was that added for?". Because it clearly is
there to catch, well, something you'll catch anyway in testing. Because
if you fail detecting you're running as a TDX guest, you'll know pretty
early about it.

So if it is pure paranoia, you should drop it. Or if there's at least
some merit for it being there, then slap a comment above it why that
check is happening.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 23/30] x86/boot: Avoid #VE during boot for TDX platforms
  2022-03-02 14:27 ` [PATCHv5 23/30] x86/boot: Avoid #VE during boot for TDX platforms Kirill A. Shutemov
@ 2022-03-07  9:29   ` Xiaoyao Li
  2022-03-07 22:33     ` Kirill A. Shutemov
  2022-03-07 22:36     ` [PATCHv5.1 " Kirill A. Shutemov
  0 siblings, 2 replies; 84+ messages in thread
From: Xiaoyao Li @ 2022-03-07  9:29 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On 3/2/2022 10:27 PM, Kirill A. Shutemov wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> There are a few MSRs and control register bits that the kernel
> normally needs to modify during boot. But, TDX disallows
> modification of these registers to help provide consistent security
> guarantees. Fortunately, TDX ensures that these are all in the correct
> state before the kernel loads, which means the kernel does not need to
> modify them.
> 
> The conditions to avoid are:
> 
>   * Any writes to the EFER MSR
>   * Clearing CR3.MCE

typo. CR4.MCE

BTW, I remember there was a patch to clear X86_FEATURE_MCE for TDX 
guest. Why does that get dropped?

Even though CPUID reports MCE is supported, all the access to MCE 
related MSRs causes #VE. If they are accessed via mce_rdmsrl(), the #VE 
will be fixed up and goes to ex_handler_msr_mce(). Finally lead to panic().




^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCHv5.1 01/30] x86/tdx: Detect running as a TDX guest in early boot
  2022-03-04 16:02       ` Borislav Petkov
@ 2022-03-07 22:24         ` Kirill A. Shutemov
  2022-03-09 18:22           ` Borislav Petkov
  0 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-07 22:24 UTC (permalink / raw)
  To: bp
  Cc: aarcange, ak, brijesh.singh, dan.j.williams, dave.hansen,
	dave.hansen, david, hpa, jgross, jmattson, joro, jpoimboe,
	kirill.shutemov, knsathya, linux-kernel, luto, mingo, pbonzini,
	peterz, sathyanarayanan.kuppuswamy, sdeep, seanjc, tglx,
	thomas.lendacky, tony.luck, vkuznets, wanpengli, x86

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

In preparation of extending cc_platform_has() API to support TDX guest,
use CPUID instruction to detect support for TDX guests in the early
boot code (via tdx_early_init()). Since copy_bootdata() is the first
user of cc_platform_has() API, detect the TDX guest status before it.

Define a synthetic feature flag (X86_FEATURE_TDX_GUEST) and set this
bit in a valid TDX guest platform.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 v5.1:
   - Drop BUILD_BUG_ON()
---
 arch/x86/Kconfig                         | 12 ++++++++++++
 arch/x86/coco/Makefile                   |  2 ++
 arch/x86/coco/tdx.c                      | 22 ++++++++++++++++++++++
 arch/x86/include/asm/cpufeatures.h       |  1 +
 arch/x86/include/asm/disabled-features.h |  8 +++++++-
 arch/x86/include/asm/tdx.h               | 21 +++++++++++++++++++++
 arch/x86/kernel/head64.c                 |  4 ++++
 7 files changed, 69 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/coco/tdx.c
 create mode 100644 arch/x86/include/asm/tdx.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 57a4e0285a80..c346d66b51fc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -880,6 +880,18 @@ config ACRN_GUEST
 	  IOT with small footprint and real-time features. More details can be
 	  found in https://projectacrn.org/.
 
+config INTEL_TDX_GUEST
+	bool "Intel TDX (Trust Domain Extensions) - Guest Support"
+	depends on X86_64 && CPU_SUP_INTEL
+	depends on X86_X2APIC
+	help
+	  Support running as a guest under Intel TDX.  Without this support,
+	  the guest kernel can not boot or run under TDX.
+	  TDX includes memory encryption and integrity capabilities
+	  which protect the confidentiality and integrity of guest
+	  memory contents and CPU state. TDX guests are protected from
+	  some attacks from the VMM.
+
 endif #HYPERVISOR_GUEST
 
 source "arch/x86/Kconfig.cpu"
diff --git a/arch/x86/coco/Makefile b/arch/x86/coco/Makefile
index c1ead00017a7..32f4c6e6f199 100644
--- a/arch/x86/coco/Makefile
+++ b/arch/x86/coco/Makefile
@@ -4,3 +4,5 @@ KASAN_SANITIZE_core.o	:= n
 CFLAGS_core.o		+= -fno-stack-protector
 
 obj-y += core.o
+
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
new file mode 100644
index 000000000000..97674471fd1e
--- /dev/null
+++ b/arch/x86/coco/tdx.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2021-2022 Intel Corporation */
+
+#undef pr_fmt
+#define pr_fmt(fmt)     "tdx: " fmt
+
+#include <linux/cpufeature.h>
+#include <asm/tdx.h>
+
+void __init tdx_early_init(void)
+{
+	u32 eax, sig[3];
+
+	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
+
+	if (memcmp(TDX_IDENT, sig, sizeof(sig)))
+		return;
+
+	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+	pr_info("Guest detected\n");
+}
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 5cd22090e53d..cacc8dde854b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -238,6 +238,7 @@
 #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
 #define X86_FEATURE_PVUNLOCK		( 8*32+20) /* "" PV unlock function */
 #define X86_FEATURE_VCPUPREEMPT		( 8*32+21) /* "" PV vcpu_is_preempted function */
+#define X86_FEATURE_TDX_GUEST		( 8*32+22) /* Intel Trust Domain Extensions Guest */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE		( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 1231d63f836d..b37de8268c9a 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -68,6 +68,12 @@
 # define DISABLE_SGX	(1 << (X86_FEATURE_SGX & 31))
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+# define DISABLE_TDX_GUEST	0
+#else
+# define DISABLE_TDX_GUEST	(1 << (X86_FEATURE_TDX_GUEST & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -79,7 +85,7 @@
 #define DISABLED_MASK5	0
 #define DISABLED_MASK6	0
 #define DISABLED_MASK7	(DISABLE_PTI)
-#define DISABLED_MASK8	0
+#define DISABLED_MASK8	(DISABLE_TDX_GUEST)
 #define DISABLED_MASK9	(DISABLE_SMAP|DISABLE_SGX)
 #define DISABLED_MASK10	0
 #define DISABLED_MASK11	0
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
new file mode 100644
index 000000000000..ba8042ce61c2
--- /dev/null
+++ b/arch/x86/include/asm/tdx.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2021-2022 Intel Corporation */
+#ifndef _ASM_X86_TDX_H
+#define _ASM_X86_TDX_H
+
+#include <linux/init.h>
+
+#define TDX_CPUID_LEAF_ID	0x21
+#define TDX_IDENT		"IntelTDX    "
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+
+void __init tdx_early_init(void);
+
+#else
+
+static inline void tdx_early_init(void) { };
+
+#endif /* CONFIG_INTEL_TDX_GUEST */
+
+#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 4f5ecbbaae77..6dff50c3edd6 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -40,6 +40,7 @@
 #include <asm/extable.h>
 #include <asm/trapnr.h>
 #include <asm/sev.h>
+#include <asm/tdx.h>
 
 /*
  * Manage page tables very early on.
@@ -514,6 +515,9 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 
 	idt_setup_early_handler();
 
+	/* Needed before cc_platform_has() can be used for TDX */
+	tdx_early_init();
+
 	copy_bootdata(__va(real_mode_data));
 
 	/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCHv5.1 12/30] x86/tdx: Detect TDX at early kernel decompression time
  2022-03-02 14:27 ` [PATCHv5 12/30] x86/tdx: Detect TDX at early kernel decompression time Kirill A. Shutemov
@ 2022-03-07 22:27   ` Kirill A. Shutemov
  0 siblings, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-07 22:27 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: aarcange, ak, bp, brijesh.singh, dan.j.williams, dave.hansen,
	dave.hansen, david, hpa, jgross, jmattson, joro, jpoimboe,
	knsathya, linux-kernel, luto, mingo, pbonzini, peterz,
	sathyanarayanan.kuppuswamy, sdeep, seanjc, tglx, thomas.lendacky,
	tony.luck, vkuznets, wanpengli, x86

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

The early decompression code does port I/O for its console output. But,
handling the decompression-time port I/O demands a different approach
from normal runtime because the IDT required to support #VE based port
I/O emulation is not yet set up. Paravirtualizing I/O calls during
the decompression step is acceptable because the decompression code
doesn't have a lot of call sites to IO instruction.

To support port I/O in decompression code, TDX must be detected before
the decompression code might do port I/O. Detect whether the kernel runs
in a TDX guest.

Add an early_is_tdx_guest() interface to query the cached TDX guest
status in the decompression code.

TDX is detected with CPUID. Make cpuid_count() accessible outside
boot/cpuflags.c.

TDX detection in the main kernel is very similar. Move common bits
into <asm/shared/tdx.h>.

The actual port I/O paravirtualization will come later in the series.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 v5.1:
  - Drop BUILD_BUG_ON()
---
 arch/x86/boot/compressed/Makefile |  1 +
 arch/x86/boot/compressed/misc.c   |  8 ++++++++
 arch/x86/boot/compressed/misc.h   |  2 ++
 arch/x86/boot/compressed/tdx.c    | 26 ++++++++++++++++++++++++++
 arch/x86/boot/compressed/tdx.h    | 15 +++++++++++++++
 arch/x86/boot/cpuflags.c          |  3 +--
 arch/x86/boot/cpuflags.h          |  1 +
 arch/x86/include/asm/shared/tdx.h |  8 ++++++++
 arch/x86/include/asm/tdx.h        |  4 +---
 9 files changed, 63 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdx.c
 create mode 100644 arch/x86/boot/compressed/tdx.h
 create mode 100644 arch/x86/include/asm/shared/tdx.h

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6115274fe10f..732f6b21ecbd 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -101,6 +101,7 @@ ifdef CONFIG_X86_64
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index a4339cb2d247..2b1169869b96 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -370,6 +370,14 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 	lines = boot_params->screen_info.orig_video_lines;
 	cols = boot_params->screen_info.orig_video_cols;
 
+	/*
+	 * Detect TDX guest environment.
+	 *
+	 * It has to be done before console_init() in order to use
+	 * paravirtualized port I/O operations if needed.
+	 */
+	early_tdx_detect();
+
 	console_init();
 
 	/*
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 16ed360b6692..0d8e275a9d96 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -28,6 +28,8 @@
 #include <asm/bootparam.h>
 #include <asm/desc_defs.h>
 
+#include "tdx.h"
+
 #define BOOT_CTYPE_H
 #include <linux/acpi.h>
 
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
new file mode 100644
index 000000000000..d4f195e9d1ef
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.c
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include "../cpuflags.h"
+#include "../string.h"
+
+#include <asm/shared/tdx.h>
+
+static bool tdx_guest_detected;
+
+bool early_is_tdx_guest(void)
+{
+	return tdx_guest_detected;
+}
+
+void early_tdx_detect(void)
+{
+	u32 eax, sig[3];
+
+	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
+
+	if (memcmp(TDX_IDENT, sig, sizeof(sig)))
+		return;
+
+	/* Cache TDX guest feature status */
+	tdx_guest_detected = true;
+}
diff --git a/arch/x86/boot/compressed/tdx.h b/arch/x86/boot/compressed/tdx.h
new file mode 100644
index 000000000000..a7bff6ae002e
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BOOT_COMPRESSED_TDX_H
+#define BOOT_COMPRESSED_TDX_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+void early_tdx_detect(void);
+bool early_is_tdx_guest(void);
+#else
+static inline void early_tdx_detect(void) { };
+static inline bool early_is_tdx_guest(void) { return false; }
+#endif
+
+#endif /* BOOT_COMPRESSED_TDX_H */
diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
index a0b75f73dc63..a83d67ec627d 100644
--- a/arch/x86/boot/cpuflags.c
+++ b/arch/x86/boot/cpuflags.c
@@ -71,8 +71,7 @@ int has_eflag(unsigned long mask)
 # define EBX_REG "=b"
 #endif
 
-static inline void cpuid_count(u32 id, u32 count,
-		u32 *a, u32 *b, u32 *c, u32 *d)
+void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d)
 {
 	asm volatile(".ifnc %%ebx,%3 ; movl  %%ebx,%3 ; .endif	\n\t"
 		     "cpuid					\n\t"
diff --git a/arch/x86/boot/cpuflags.h b/arch/x86/boot/cpuflags.h
index 2e20814d3ce3..475b8fde90f7 100644
--- a/arch/x86/boot/cpuflags.h
+++ b/arch/x86/boot/cpuflags.h
@@ -17,5 +17,6 @@ extern u32 cpu_vendor[3];
 
 int has_eflag(unsigned long mask);
 void get_cpuflags(void);
+void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d);
 
 #endif
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
new file mode 100644
index 000000000000..8209ba9ffe1a
--- /dev/null
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SHARED_TDX_H
+#define _ASM_X86_SHARED_TDX_H
+
+#define TDX_CPUID_LEAF_ID	0x21
+#define TDX_IDENT		"IntelTDX    "
+
+#endif /* _ASM_X86_SHARED_TDX_H */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 1f150e7a2f8f..76cffbda0e79 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -6,9 +6,7 @@
 #include <linux/bits.h>
 #include <linux/init.h>
 #include <asm/ptrace.h>
-
-#define TDX_CPUID_LEAF_ID	0x21
-#define TDX_IDENT		"IntelTDX    "
+#include <asm/shared/tdx.h>
 
 #define TDX_HYPERCALL_STANDARD  0
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 23/30] x86/boot: Avoid #VE during boot for TDX platforms
  2022-03-07  9:29   ` Xiaoyao Li
@ 2022-03-07 22:33     ` Kirill A. Shutemov
  2022-03-08  1:19       ` Xiaoyao Li
  2022-03-07 22:36     ` [PATCHv5.1 " Kirill A. Shutemov
  1 sibling, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-07 22:33 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: tglx, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On Mon, Mar 07, 2022 at 05:29:27PM +0800, Xiaoyao Li wrote:
> On 3/2/2022 10:27 PM, Kirill A. Shutemov wrote:
> > From: Sean Christopherson <seanjc@google.com>
> > 
> > There are a few MSRs and control register bits that the kernel
> > normally needs to modify during boot. But, TDX disallows
> > modification of these registers to help provide consistent security
> > guarantees. Fortunately, TDX ensures that these are all in the correct
> > state before the kernel loads, which means the kernel does not need to
> > modify them.
> > 
> > The conditions to avoid are:
> > 
> >   * Any writes to the EFER MSR
> >   * Clearing CR3.MCE
> 
> typo. CR4.MCE

Thanks, will send updated patch.

> BTW, I remember there was a patch to clear X86_FEATURE_MCE for TDX guest.
> Why does that get dropped?

It is not dropped. It is just not part of the initial submission. It will
come later.

> Even though CPUID reports MCE is supported, all the access to MCE related
> MSRs causes #VE. If they are accessed via mce_rdmsrl(), the #VE will be
> fixed up and goes to ex_handler_msr_mce(). Finally lead to panic().

It is not panic, but warning. Like this:

	unchecked MSR access error: RDMSR from 0x179 at rIP: 0xffffffff810df1e9 (__mcheck_cpu_cap_init+0x9/0x130)
	Call Trace:
	 <TASK>
	 mcheck_cpu_init+0x3d/0x2c0
	 identify_cpu+0x85a/0x910
	 identify_boot_cpu+0xc/0x98
	 check_bugs+0x6/0xa7
	 start_kernel+0x363/0x3d1
	 secondary_startup_64_no_verify+0xe5/0xeb
	 </TASK>

It is annoying, but not fatal. The patchset is big enough as it is.
I tried to keep patch number under control.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCHv5.1 23/30] x86/boot: Avoid #VE during boot for TDX platforms
  2022-03-07  9:29   ` Xiaoyao Li
  2022-03-07 22:33     ` Kirill A. Shutemov
@ 2022-03-07 22:36     ` Kirill A. Shutemov
  1 sibling, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-07 22:36 UTC (permalink / raw)
  To: xiaoyao.li
  Cc: aarcange, ak, bp, brijesh.singh, dan.j.williams, dave.hansen,
	david, hpa, jgross, jmattson, joro, jpoimboe, kirill.shutemov,
	knsathya, linux-kernel, luto, mingo, pbonzini, peterz,
	sathyanarayanan.kuppuswamy, sdeep, seanjc, tglx, thomas.lendacky,
	tony.luck, vkuznets, wanpengli, x86

From: Sean Christopherson <seanjc@google.com>

There are a few MSRs and control register bits that the kernel
normally needs to modify during boot. But, TDX disallows
modification of these registers to help provide consistent security
guarantees. Fortunately, TDX ensures that these are all in the correct
state before the kernel loads, which means the kernel does not need to
modify them.

The conditions to avoid are:

 * Any writes to the EFER MSR
 * Clearing CR4.MCE

This theoretically makes the guest boot more fragile. If, for instance,
EFER was set up incorrectly and a WRMSR was performed, it will trigger
early exception panic or a triple fault, if it's before early
exceptions are set up. However, this is likely to trip up the guest
BIOS long before control reaches the kernel. In any case, these kinds
of problems are unlikely to occur in production environments, and
developers have good debug tools to fix them quickly.

Change the common boot code to work on TDX and non-TDX systems.
This should have no functional effect on non-TDX systems.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 v5.1:
  - Fix typo in commit message: CR3.MCE -> CR4.MCE.
---
 arch/x86/Kconfig                     |  1 +
 arch/x86/boot/compressed/head_64.S   | 20 ++++++++++++++++++--
 arch/x86/boot/compressed/pgtable.h   |  2 +-
 arch/x86/kernel/head_64.S            | 28 ++++++++++++++++++++++++++--
 arch/x86/realmode/rm/trampoline_64.S | 13 ++++++++++++-
 5 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d2f45e58e846..98efb35ed7b1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -886,6 +886,7 @@ config INTEL_TDX_GUEST
 	depends on X86_X2APIC
 	select ARCH_HAS_CC_PLATFORM
 	select DYNAMIC_PHYSICAL_MASK
+	select X86_MCE
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d0c3d33f3542..6d903b2fc544 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -643,12 +643,28 @@ SYM_CODE_START(trampoline_32bit_src)
 	movl	$MSR_EFER, %ecx
 	rdmsr
 	btsl	$_EFER_LME, %eax
+	/* Avoid writing EFER if no change was made (for TDX guest) */
+	jc	1f
 	wrmsr
-	popl	%edx
+1:	popl	%edx
 	popl	%ecx
 
+#ifdef CONFIG_X86_MCE
+	/*
+	 * Preserve CR4.MCE if the kernel will enable #MC support.
+	 * Clearing MCE may fault in some environments (that also force #MC
+	 * support). Any machine check that occurs before #MC support is fully
+	 * configured will crash the system regardless of the CR4.MCE value set
+	 * here.
+	 */
+	movl	%cr4, %eax
+	andl	$X86_CR4_MCE, %eax
+#else
+	movl	$0, %eax
+#endif
+
 	/* Enable PAE and LA57 (if required) paging modes */
-	movl	$X86_CR4_PAE, %eax
+	orl	$X86_CR4_PAE, %eax
 	testl	%edx, %edx
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
 
 #define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE	0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE	0x80
 
 #define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
 
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9c63fc5988cd..184b7468ea76 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -140,8 +140,22 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	addq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
+#ifdef CONFIG_X86_MCE
+	/*
+	 * Preserve CR4.MCE if the kernel will enable #MC support.
+	 * Clearing MCE may fault in some environments (that also force #MC
+	 * support). Any machine check that occurs before #MC support is fully
+	 * configured will crash the system regardless of the CR4.MCE value set
+	 * here.
+	 */
+	movq	%cr4, %rcx
+	andl	$X86_CR4_MCE, %ecx
+#else
+	movl	$0, %ecx
+#endif
+
 	/* Enable PAE mode, PGE and LA57 */
-	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+	orl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
 #ifdef CONFIG_X86_5LEVEL
 	testl	$1, __pgtable_l5_enabled(%rip)
 	jz	1f
@@ -246,13 +260,23 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	/* Setup EFER (Extended Feature Enable Register) */
 	movl	$MSR_EFER, %ecx
 	rdmsr
+	/*
+	 * Preserve current value of EFER for comparison and to skip
+	 * EFER writes if no change was made (for TDX guest)
+	 */
+	movl    %eax, %edx
 	btsl	$_EFER_SCE, %eax	/* Enable System Call */
 	btl	$20,%edi		/* No Execute supported? */
 	jnc     1f
 	btsl	$_EFER_NX, %eax
 	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
-1:	wrmsr				/* Make changes effective */
 
+	/* Avoid writing EFER if no change was made (for TDX guest) */
+1:	cmpl	%edx, %eax
+	je	1f
+	xor	%edx, %edx
+	wrmsr				/* Make changes effective */
+1:
 	/* Setup cr0 */
 	movl	$CR0_STATE, %eax
 	/* Make changes effective */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index d380f2d1fd23..e38d61d6562e 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,11 +143,22 @@ SYM_CODE_START(startup_32)
 	movl	%eax, %cr3
 
 	# Set up EFER
+	movl	$MSR_EFER, %ecx
+	rdmsr
+	/*
+	 * Skip writing to EFER if the register already has desired
+	 * value (to avoid #VE for the TDX guest).
+	 */
+	cmp	pa_tr_efer, %eax
+	jne	.Lwrite_efer
+	cmp	pa_tr_efer + 4, %edx
+	je	.Ldone_efer
+.Lwrite_efer:
 	movl	pa_tr_efer, %eax
 	movl	pa_tr_efer + 4, %edx
-	movl	$MSR_EFER, %ecx
 	wrmsr
 
+.Ldone_efer:
 	# Enable paging and in turn activate Long Mode.
 	movl	$CR0_STATE, %eax
 	movl	%eax, %cr0
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 23/30] x86/boot: Avoid #VE during boot for TDX platforms
  2022-03-07 22:33     ` Kirill A. Shutemov
@ 2022-03-08  1:19       ` Xiaoyao Li
  2022-03-08 16:41         ` Kirill A. Shutemov
  0 siblings, 1 reply; 84+ messages in thread
From: Xiaoyao Li @ 2022-03-08  1:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On 3/8/2022 6:33 AM, Kirill A. Shutemov wrote:
> On Mon, Mar 07, 2022 at 05:29:27PM +0800, Xiaoyao Li wrote:
...
>> Even though CPUID reports MCE is supported, all the access to MCE related
>> MSRs causes #VE. If they are accessed via mce_rdmsrl(), the #VE will be
>> fixed up and goes to ex_handler_msr_mce(). Finally lead to panic().
> 
> It is not panic, but warning. Like this:
> 
> 	unchecked MSR access error: RDMSR from 0x179 at rIP: 0xffffffff810df1e9 (__mcheck_cpu_cap_init+0x9/0x130)
> 	Call Trace:
> 	 <TASK>
> 	 mcheck_cpu_init+0x3d/0x2c0
> 	 identify_cpu+0x85a/0x910
> 	 identify_boot_cpu+0xc/0x98
> 	 check_bugs+0x6/0xa7
> 	 start_kernel+0x363/0x3d1
> 	 secondary_startup_64_no_verify+0xe5/0xeb
> 	 </TASK>
> 
> It is annoying, but not fatal. The patchset is big enough as it is.
> I tried to keep patch number under control.
> 

I did hit panic as below.

[    0.578792] mce: MSR access error: RDMSR from 0x475 at rIP: 
0xffffffffb94daa92 (mce_rdmsrl+0x22/0x60)
[    0.578792] Call Trace:
[    0.578792]  <TASK>
[    0.578792]  machine_check_poll+0xf0/0x260
[    0.578792]  __mcheck_cpu_init_generic+0x3d/0xb0
[    0.578792]  mcheck_cpu_init+0x16b/0x4a0
[    0.578792]  identify_cpu+0x467/0x5c0
[    0.578792]  identify_boot_cpu+0x10/0x9a
[    0.578792]  check_bugs+0x2a/0xa06
[    0.578792]  start_kernel+0x6bc/0x6f1
[    0.578792]  x86_64_start_reservations+0x24/0x26
[    0.578792]  x86_64_start_kernel+0xad/0xb2
[    0.578792]  secondary_startup_64_no_verify+0xe4/0xeb
[    0.578792]  </TASK>
[    0.578792] Kernel panic - not syncing: MCA architectural violation!
[    0.578792] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
5.17.0-rc5-td-guest-upstream+ #2
[    0.578792] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
0.0.0 02/06/2015
[    0.578792] Call Trace:
[    0.578792]  <TASK>
[    0.578792]  dump_stack_lvl+0x49/0x5f
[    0.578792]  dump_stack+0x10/0x12
[    0.578792]  panic+0xf9/0x2d0
[    0.578792]  ex_handler_msr_mce+0x5e/0x5e
[    0.578792]  fixup_exception+0x2f4/0x310
[    0.578792]  exc_virtualization_exception+0x9b/0x100
[    0.578792]  asm_exc_virtualization_exception+0x12/0x40
[    0.578792] RIP: 0010:mce_rdmsrl+0x22/0x60
[    0.578792] Code: a0 b9 e8 75 4d fb ff 90 55 48 89 e5 41 54 53 89 fb 
48 c7 c7 9c c1 f6 b9 e8 4b 28 00 00 65 8a 05 97 52 b4 46 84 c0 75 10 89 
d9 <0f> 32 48 c1 e2 20 48 09 d0 5b 41 5c 5d c3 89 df e8 c9 5a 17 ff 4c
[    0.578792] RSP: 0000:ffffffffba203cd8 EFLAGS: 00010246
[    0.578792] RAX: 0000000000000000 RBX: 0000000000000475 RCX: 
0000000000000475
[    0.578792] RDX: 00000000000001d0 RSI: ffffffffb9f6c19c RDI: 
ffffffffb9ece016
[    0.578792] RBP: ffffffffba203ce8 R08: ffffffffba203cb0 R09: 
ffffffffba203cb4
[    0.578792] R10: 0000000000000000 R11: 000000000000000f R12: 
0000000000000001
[    0.578792] R13: ffffffffba203dc0 R14: 000000000000000a R15: 
000000000000001d
[    0.578792]  ? mce_rdmsrl+0x15/0x60
[    0.578792]  machine_check_poll+0xf0/0x260
[    0.578792]  __mcheck_cpu_init_generic+0x3d/0xb0
[    0.578792]  mcheck_cpu_init+0x16b/0x4a0
[    0.578792]  identify_cpu+0x467/0x5c0
[    0.578792]  identify_boot_cpu+0x10/0x9a
[    0.578792]  check_bugs+0x2a/0xa06
[    0.578792]  start_kernel+0x6bc/0x6f1
[    0.578792]  x86_64_start_reservations+0x24/0x26
[    0.578792]  x86_64_start_kernel+0xad/0xb2
[    0.578792]  secondary_startup_64_no_verify+0xe4/0xeb
[    0.578792]  </TASK>
[    0.578792] ---[ end Kernel panic - not syncing: MCA architectural 
violation! ]---


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 23/30] x86/boot: Avoid #VE during boot for TDX platforms
  2022-03-08  1:19       ` Xiaoyao Li
@ 2022-03-08 16:41         ` Kirill A. Shutemov
  0 siblings, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-08 16:41 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: tglx, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On Tue, Mar 08, 2022 at 09:19:06AM +0800, Xiaoyao Li wrote:
> On 3/8/2022 6:33 AM, Kirill A. Shutemov wrote:
> > On Mon, Mar 07, 2022 at 05:29:27PM +0800, Xiaoyao Li wrote:
> ...
> > > Even though CPUID reports MCE is supported, all the access to MCE related
> > > MSRs causes #VE. If they are accessed via mce_rdmsrl(), the #VE will be
> > > fixed up and goes to ex_handler_msr_mce(). Finally lead to panic().
> > 
> > It is not panic, but warning. Like this:
> > 
> > 	unchecked MSR access error: RDMSR from 0x179 at rIP: 0xffffffff810df1e9 (__mcheck_cpu_cap_init+0x9/0x130)
> > 	Call Trace:
> > 	 <TASK>
> > 	 mcheck_cpu_init+0x3d/0x2c0
> > 	 identify_cpu+0x85a/0x910
> > 	 identify_boot_cpu+0xc/0x98
> > 	 check_bugs+0x6/0xa7
> > 	 start_kernel+0x363/0x3d1
> > 	 secondary_startup_64_no_verify+0xe5/0xeb
> > 	 </TASK>
> > 
> > It is annoying, but not fatal. The patchset is big enough as it is.
> > I tried to keep patch number under control.
> > 
> 
> I did hit panic as below.
> 
> [    0.578792] mce: MSR access error: RDMSR from 0x475 at rIP:
> 0xffffffffb94daa92 (mce_rdmsrl+0x22/0x60)
> [    0.578792] Call Trace:
> [    0.578792]  <TASK>
> [    0.578792]  machine_check_poll+0xf0/0x260
> [    0.578792]  __mcheck_cpu_init_generic+0x3d/0xb0
> [    0.578792]  mcheck_cpu_init+0x16b/0x4a0
> [    0.578792]  identify_cpu+0x467/0x5c0
> [    0.578792]  identify_boot_cpu+0x10/0x9a
> [    0.578792]  check_bugs+0x2a/0xa06
> [    0.578792]  start_kernel+0x6bc/0x6f1
> [    0.578792]  x86_64_start_reservations+0x24/0x26
> [    0.578792]  x86_64_start_kernel+0xad/0xb2
> [    0.578792]  secondary_startup_64_no_verify+0xe4/0xeb
> [    0.578792]  </TASK>
> [    0.578792] Kernel panic - not syncing: MCA architectural violation!
> [    0.578792] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 5.17.0-rc5-td-guest-upstream+ #2
> [    0.578792] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> 0.0.0 02/06/2015
> [    0.578792] Call Trace:
> [    0.578792]  <TASK>
> [    0.578792]  dump_stack_lvl+0x49/0x5f
> [    0.578792]  dump_stack+0x10/0x12
> [    0.578792]  panic+0xf9/0x2d0
> [    0.578792]  ex_handler_msr_mce+0x5e/0x5e
> [    0.578792]  fixup_exception+0x2f4/0x310
> [    0.578792]  exc_virtualization_exception+0x9b/0x100
> [    0.578792]  asm_exc_virtualization_exception+0x12/0x40
> [    0.578792] RIP: 0010:mce_rdmsrl+0x22/0x60
> [    0.578792] Code: a0 b9 e8 75 4d fb ff 90 55 48 89 e5 41 54 53 89 fb 48
> c7 c7 9c c1 f6 b9 e8 4b 28 00 00 65 8a 05 97 52 b4 46 84 c0 75 10 89 d9 <0f>
> 32 48 c1 e2 20 48 09 d0 5b 41 5c 5d c3 89 df e8 c9 5a 17 ff 4c
> [    0.578792] RSP: 0000:ffffffffba203cd8 EFLAGS: 00010246
> [    0.578792] RAX: 0000000000000000 RBX: 0000000000000475 RCX:
> 0000000000000475
> [    0.578792] RDX: 00000000000001d0 RSI: ffffffffb9f6c19c RDI:
> ffffffffb9ece016
> [    0.578792] RBP: ffffffffba203ce8 R08: ffffffffba203cb0 R09:
> ffffffffba203cb4
> [    0.578792] R10: 0000000000000000 R11: 000000000000000f R12:
> 0000000000000001
> [    0.578792] R13: ffffffffba203dc0 R14: 000000000000000a R15:
> 000000000000001d
> [    0.578792]  ? mce_rdmsrl+0x15/0x60
> [    0.578792]  machine_check_poll+0xf0/0x260
> [    0.578792]  __mcheck_cpu_init_generic+0x3d/0xb0
> [    0.578792]  mcheck_cpu_init+0x16b/0x4a0
> [    0.578792]  identify_cpu+0x467/0x5c0
> [    0.578792]  identify_boot_cpu+0x10/0x9a
> [    0.578792]  check_bugs+0x2a/0xa06
> [    0.578792]  start_kernel+0x6bc/0x6f1
> [    0.578792]  x86_64_start_reservations+0x24/0x26
> [    0.578792]  x86_64_start_kernel+0xad/0xb2
> [    0.578792]  secondary_startup_64_no_verify+0xe4/0xeb
> [    0.578792]  </TASK>
> [    0.578792] ---[ end Kernel panic - not syncing: MCA architectural

Hm. Do you have MSR_IA32_MCG_CAP read successfully?

Otherwise you should not get inside the loop in machine_check_poll()
because mce_num_banks would be 0. In this case MSR 0x475 is never touched.

Anyway, the patchset is not intended to be complete enabling of TDX. There
are a lot of corners to be smoothed before it is production ready. Let's
keep as it is.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 02/30] x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers
  2022-03-02 14:27 ` [PATCHv5 02/30] x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers Kirill A. Shutemov
@ 2022-03-08 19:56   ` Dave Hansen
  2022-03-10 12:32   ` Borislav Petkov
  1 sibling, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2022-03-08 19:56 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On 3/2/22 06:27, Kirill A. Shutemov wrote:
> Secure Arbitration Mode (SEAM) is an extension of VMX architecture.  It
> defines a new VMX root operation (SEAM VMX root) and a new VMX non-root
> operation (SEAM VMX non-root) which are both isolated from the legacy
> VMX operation where the host kernel runs.
> 
> A CPU-attested software module (called 'TDX module') runs in SEAM VMX
> root to manage and protect VMs running in SEAM VMX non-root.  SEAM VMX
> root is also used to host another CPU-attested software module (called
> 'P-SEAMLDR') to load and update the TDX module.
> 
> Host kernel transits to either P-SEAMLDR or TDX module via the new
> SEAMCALL instruction, which is essentially a VMExit from VMX root mode
> to SEAM VMX root mode.  SEAMCALLs are leaf functions defined by
> P-SEAMLDR and TDX module around the new SEAMCALL instruction.
> 
> A guest kernel can also communicate with TDX module via TDCALL
> instruction.
> 
> TDCALLs and SEAMCALLs use an ABI different from the x86-64 system-v ABI.
> RAX is used to carry both the SEAMCALL leaf function number (input) and
> the completion status (output).  Additional GPRs (RCX, RDX, R8-R11) may
> be further used as both input and output operands in individual leaf.
> 
> TDCALL and SEAMCALL share the same ABI and require the largely same
> code to pass down arguments and retrieve results.
> 
> Define an assembly macro that can be used to implement C wrapper for
> both TDCALL and SEAMCALL.

It's probably also worth mentioning that the SEAMCALL half won't get
used in this series.

> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index ba8042ce61c2..e5ff8ed59adf 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -8,6 +8,33 @@
>  #define TDX_CPUID_LEAF_ID	0x21
>  #define TDX_IDENT		"IntelTDX    "
>  
> +/*
> + * SW-defined error codes.
> + *
> + * Bits 47:40 == 0xFF indicate Reserved status code class that never used by
> + * TDX module.
That's a bit clunky.  Perhaps replace it with this:

 * Bits 47:40 == 0xFF indicate a "Reserved" status code class that is
   never used by the TDX module.

> + */
> +#define TDX_ERROR			(1UL << 63)
> +#define TDX_SW_ERROR			(TDX_ERROR | GENMASK_ULL(40, 47))
> +#define TDX_SEAMCALL_VMFAILINVALID	(TDX_SW_ERROR | 0xFFFF0000ULL)
> +
> +#ifndef __ASSEMBLY__

The "UL" construct doesn't work in the assembler.  But, this won't shoIf
you use _BITUL, it will do the hard work for you.

> +/*
> + * Used to gather the output registers values of the TDCALL and SEAMCALL
> + * instructions when requesting services from the TDX module.
> + *
> + * This is a software only structure and not part of the TDX module/VMM ABI.
> + */
> +struct tdx_module_output {
> +	u64 rcx;
> +	u64 rdx;
> +	u64 r8;
> +	u64 r9;
> +	u64 r10;
> +	u64 r11;
> +};

With those fixed:

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 03/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-03-02 14:27 ` [PATCHv5 03/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions Kirill A. Shutemov
@ 2022-03-08 20:03   ` Dave Hansen
  2022-03-10 15:30   ` Borislav Petkov
  1 sibling, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2022-03-08 20:03 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On 3/2/22 06:27, Kirill A. Shutemov wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> Guests communicate with VMMs with hypercalls. Historically, these
> are implemented using instructions that are known to cause VMEXITs
> like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
> expose the guest state to the host. This prevents the old hypercall
> mechanisms from working. So, to communicate with VMM, TDX
> specification defines a new instruction called TDCALL.
> 
> In a TDX based VM, since the VMM is an untrusted entity, an intermediary
> layer -- TDX module -- facilitates secure communication between the host
> and the guest. TDX module is loaded like a firmware into a special CPU
> mode called SEAM. TDX guests communicate with the TDX module using the
> TDCALL instruction.
> 
> A guest uses TDCALL to communicate with both the TDX module and VMM.
> The value of the RAX register when executing the TDCALL instruction is
> used to determine the TDCALL type. A variant of TDCALL used to communicate
> with the VMM is called TDVMCALL.
> 
> Add generic interfaces to communicate with the TDX module and VMM
> (using the TDCALL instruction).
> 
> __tdx_hypercall()    - Used by the guest to request services from the
> 		       VMM (via TDVMCALL).
> __tdx_module_call()  - Used to communicate with the TDX module (via
> 		       TDCALL).
> 
> Also define an additional wrapper _tdx_hypercall(), which adds error
> handling support for the TDCALL failure.
> 
> The __tdx_module_call() and __tdx_hypercall() helper functions are
> implemented in assembly in a .S file.  The TDCALL ABI requires
> shuffling arguments in and out of registers, which proved to be
> awkward with inline assembly.
> 
> Just like syscalls, not all TDVMCALL use cases need to use the same
> number of argument registers. The implementation here picks the current
> worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
> than 4 arguments, there will end up being a few superfluous (cheap)
> instructions. But, this approach maximizes code reuse.
> 
> For registers used by the TDCALL instruction, please check TDX GHCI
> specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
> Interface".
> 
> Based on previous patch by Sean Christopherson.
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Looks good:

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

BTW, if you revise this again, let me have a few minutes with the
changelog.  There are, again, a few things that we should make less
clunky.  But, they aren't deal breakers.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 04/30] x86/tdx: Extend the confidential computing API to support TDX guests
  2022-03-02 14:27 ` [PATCHv5 04/30] x86/tdx: Extend the confidential computing API to support TDX guests Kirill A. Shutemov
@ 2022-03-08 20:17   ` Dave Hansen
  2022-03-09 16:01     ` [PATCHv5.1 " Kirill A. Shutemov
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2022-03-08 20:17 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On 3/2/22 06:27, Kirill A. Shutemov wrote:
...
> Like AMD SME/SEV, TDX uses a bit in the page table entry to indicate
> encryption status of the page, but the polarity of the mask is
> opposite to AMD: if the bit is set the page is accessible to VMM.

I'd much rather this be in a code comment next to the weird-looking code
than in the changelog.

> Details about which bit in the page table entry to be used to indicate
> shared/private state can be determined by using the TDINFO TDCALL.

s/can be/are/

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/Kconfig     |  1 +
>  arch/x86/coco/core.c |  4 ++++
>  arch/x86/coco/tdx.c  | 38 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 43 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index c346d66b51fc..93e67842e369 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
>  	bool "Intel TDX (Trust Domain Extensions) - Guest Support"
>  	depends on X86_64 && CPU_SUP_INTEL
>  	depends on X86_X2APIC
> +	select ARCH_HAS_CC_PLATFORM
>  	help
>  	  Support running as a guest under Intel TDX.  Without this support,
>  	  the guest kernel can not boot or run under TDX.
> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> index fc1365dd927e..9113baebbfd2 100644
> --- a/arch/x86/coco/core.c
> +++ b/arch/x86/coco/core.c
> @@ -90,6 +90,8 @@ u64 cc_mkenc(u64 val)
>  	switch (vendor) {
>  	case CC_VENDOR_AMD:
>  		return val | cc_mask;
> +	case CC_VENDOR_INTEL:
> +		return val & ~cc_mask;
>  	default:
>  		return val;
>  	}
> @@ -100,6 +102,8 @@ u64 cc_mkdec(u64 val)
>  	switch (vendor) {
>  	case CC_VENDOR_AMD:
>  		return val & ~cc_mask;
> +	case CC_VENDOR_INTEL:
> +		return val | cc_mask;
>  	default:
>  		return val;
>  	}
> diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
> index 17365fd40ba2..912ef12e434e 100644
> --- a/arch/x86/coco/tdx.c
> +++ b/arch/x86/coco/tdx.c
> @@ -5,8 +5,12 @@
>  #define pr_fmt(fmt)     "tdx: " fmt
>  
>  #include <linux/cpufeature.h>
> +#include <asm/coco.h>
>  #include <asm/tdx.h>
>  
> +/* TDX module Call Leaf IDs */
> +#define TDX_GET_INFO			1
> +
>  /*
>   * Wrapper for standard use of __tdx_hypercall with no output aside from
>   * return code.
> @@ -25,8 +29,32 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>  	return __tdx_hypercall(&args, 0);
>  }
>  
> +static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> +				   struct tdx_module_output *out)
> +{
> +	if (__tdx_module_call(fn, rcx, rdx, r8, r9, out))
> +		panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
> +}

I really think we need to document the panic()s that we add.  It might
mean duplicating a wee bit of the text from the SEAMCALL/TDCALL
assembly, but I think it's worth it so that folks don't think this is an
over-eager panic().

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 06/30] x86/traps: Refactor exc_general_protection()
  2022-03-02 14:27 ` [PATCHv5 06/30] x86/traps: Refactor exc_general_protection() Kirill A. Shutemov
@ 2022-03-08 20:18   ` Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2022-03-08 20:18 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On 3/2/22 06:27, Kirill A. Shutemov wrote:
> TDX brings a new exception -- Virtualization Exception (#VE). Handling
> of #VE structurally very similar to handling #GP.
> 
> Extract two helpers from exc_general_protection() that can be reused for
> handling #VE.
> 
> No functional changes.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 07/30] x86/traps: Add #VE support for TDX guest
  2022-03-02 14:27 ` [PATCHv5 07/30] x86/traps: Add #VE support for TDX guest Kirill A. Shutemov
@ 2022-03-08 20:29   ` Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2022-03-08 20:29 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Sean Christopherson

On 3/2/22 06:27, Kirill A. Shutemov wrote:
> Virtualization Exceptions (#VE) are delivered to TDX guests due to
> specific guest actions which may happen in either user space or the
> kernel:
> 
>  * Specific instructions (WBINVD, for example)
>  * Specific MSR accesses
>  * Specific CPUID leaf accesses
>  * Access to specific guest physical addresses
...
>  arch/x86/coco/tdx.c             | 31 +++++++++++++
>  arch/x86/include/asm/idtentry.h |  4 ++
>  arch/x86/include/asm/tdx.h      | 21 +++++++++
>  arch/x86/kernel/idt.c           |  3 ++
>  arch/x86/kernel/traps.c         | 81 +++++++++++++++++++++++++++++++++

I know it took a long time to get here, but it's really, really nice
that this ended up being all done in C without any nastiness in the
kernel to deal with things like NMIs.

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 10/30] x86/tdx: Handle CPUID via #VE
  2022-03-02 14:27 ` [PATCHv5 10/30] x86/tdx: Handle CPUID via #VE Kirill A. Shutemov
@ 2022-03-08 20:33   ` Dave Hansen
  2022-03-09 16:15     ` [PATCH] " Kirill A. Shutemov
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2022-03-08 20:33 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On 3/2/22 06:27, Kirill A. Shutemov wrote:
> In TDX guests, most CPUID leaf/sub-leaf combinations are virtualized
> by the TDX module while some trigger #VE.
> 
> Implement the #VE handling for EXIT_REASON_CPUID by handing it through
> the hypercall, which in turn lets the TDX module handle it by invoking
> the host VMM.
> 
> More details on CPUID Virtualization can be found in the TDX module
> specification, the section titled "CPUID Virtualization".
> 
> Note that VMM that handles the hypercall is not trusted. It can return
> data that may steer the guest kernel in wrong direct. Only allow  VMM
> to control range reserved for hypervisor communication. Return all-zeros
> for any CPUID outside the range.
> 
> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

It would be nice to also mention the implications of the all-zero CPUID
policy.  I'll plan to add a sentence or two when we apply this.

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO
  2022-03-02 14:27 ` [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO Kirill A. Shutemov
@ 2022-03-08 21:26   ` Dave Hansen
  2022-03-10  0:51     ` Kirill A. Shutemov
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2022-03-08 21:26 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On 3/2/22 06:27, Kirill A. Shutemov wrote:
> In non-TDX VMs, MMIO is implemented by providing the guest a mapping
> which will cause a VMEXIT on access and then the VMM emulating the
> instruction that caused the VMEXIT. That's not possible for TDX VM.
> 
> To emulate an instruction an emulator needs two things:
> 
>   - R/W access to the register file to read/modify instruction arguments
>     and see RIP of the faulted instruction.
> 
>   - Read access to memory where instruction is placed to see what to
>     emulate. In this case it is guest kernel text.
> 
> Both of them are not available to VMM in TDX environment:
> 
>   - Register file is never exposed to VMM. When a TD exits to the module,
>     it saves registers into the state-save area allocated for that TD.
>     The module then scrubs these registers before returning execution
>     control to the VMM, to help prevent leakage of TD state.
> 
>   - Memory is encrypted a TD-private key. The CPU disallows software
>     other than the TDX module and TDs from making memory accesses using
>     the private key.

Memory encryption has zero to do with this.  The TDX isolation
mechanisms are totally discrete from memory encryption, although they
are "neighbors" of sorts.

> In TDX the MMIO regions are instead configured by VMM to trigger a #VE
> exception in the guest.
> 
> Add #VE handling that emulates the MMIO instruction inside the guest and
> converts it into a controlled hypercall to the host.
> 
> MMIO addresses can be used with any CPU instruction that accesses
> memory. Address only MMIO accesses done via io.h helpers, such as
> 'readl()' or 'writeq()'.
> 
> Any CPU instruction that accesses memory can also be used to access
> MMIO.  However, by convention, MMIO access are typically performed via
> io.h helpers such as 'readl()' or 'writeq()'.
> 
> The io.h helpers intentionally use a limited set of instructions when
> accessing MMIO.  This known, limited set of instructions makes MMIO
> instruction decoding and emulation feasible in KVM hosts and SEV guests
> today.
> 
> MMIO accesses are performed without the io.h helpers are at the mercy of

		^ s/are//

> the compiler.  Compilers can and will generate a much more broad set of
> instructions which can not practically be decoded and emulated.  TDX
> guests will oops if they encounter one of these decoding failures.
> 
> This means that TDX guests *must* use the io.h helpers to access MMIO.
> 
> This requirement is not new.  Both KVM hosts and AMD SEV guests have the
> same limitations on MMIO access.
> 
> === Potential alternative approaches ===
> 
> == Paravirtualizing all MMIO ==
> 
> An alternative to letting MMIO induce a #VE exception is to avoid
> the #VE in the first place. Similar to the port I/O case, it is
> theoretically possible to paravirtualize MMIO accesses.
> 
> Like the exception-based approach offered here, a fully paravirtualized
> approach would be limited to MMIO users that leverage common
> infrastructure like the io.h macros.
> 
> However, any paravirtual approach would be patching approximately 120k
> call sites. Any paravirtual approach would need to replace a bare memory
> access instruction with (at least) a function call. With a conservative
> overhead estimation of 5 bytes per call site (CALL instruction),
> it leads to bloating code by 600k.
> 
> Many drivers will never be used in the TDX environment and the bloat
> cannot be justified.
> 
> == Patching TDX drivers ==
> 
> Rather than touching the entire kernel, it might also be possible to
> just go after drivers that use MMIO in TDX guests.  Right now, that's
> limited only to virtio and some x86-specific drivers.
> 
> All virtio MMIO appears to be done through a single function, which
> makes virtio eminently easy to patch.
> 
> This approach will be adopted in the future, removing the bulk of
> MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.

This still doesn't *quite* do it for me for a justification.  Why can't
the non-virtio cases be converted as well?  Why doesn't the "patching
MMIO sites" work for x86 code too?

You really need to convince us that *this* approach will be required
forever.

> diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
> index d00b367f8052..e6163e7e3247 100644
> --- a/arch/x86/coco/tdx.c
> +++ b/arch/x86/coco/tdx.c
> @@ -8,11 +8,17 @@
>  #include <asm/coco.h>
>  #include <asm/tdx.h>
>  #include <asm/vmx.h>
> +#include <asm/insn.h>
> +#include <asm/insn-eval.h>
>  
>  /* TDX module Call Leaf IDs */
>  #define TDX_GET_INFO			1
>  #define TDX_GET_VEINFO			3
>  
> +/* MMIO direction */
> +#define EPT_READ	0
> +#define EPT_WRITE	1
> +
>  /*
>   * Wrapper for standard use of __tdx_hypercall with no output aside from
>   * return code.
> @@ -200,6 +206,112 @@ static bool handle_cpuid(struct pt_regs *regs)
>  	return true;
>  }
>  
> +static bool mmio_read(int size, unsigned long addr, unsigned long *val)
> +{
> +	struct tdx_hypercall_args args = {
> +		.r10 = TDX_HYPERCALL_STANDARD,
> +		.r11 = hcall_func(EXIT_REASON_EPT_VIOLATION),
> +		.r12 = size,
> +		.r13 = EPT_READ,
> +		.r14 = addr,
> +		.r15 = *val,
> +	};
> +
> +	if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
> +		return false;
> +	*val = args.r11;
> +	return true;
> +}
> +
> +static bool mmio_write(int size, unsigned long addr, unsigned long val)
> +{
> +	return !_tdx_hypercall(hcall_func(EXIT_REASON_EPT_VIOLATION), size,
> +			       EPT_WRITE, addr, val);
> +}
> +
> +static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> +{
> +	char buffer[MAX_INSN_SIZE];
> +	unsigned long *reg, val;
> +	struct insn insn = {};
> +	enum mmio_type mmio;
> +	int size, extend_size;
> +	u8 extend_val = 0;
> +
> +	if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
> +		return false;
> +
> +	if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
> +		return false;
> +
> +	mmio = insn_decode_mmio(&insn, &size);
> +	if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
> +		return false;
> +
> +	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
> +		reg = insn_get_modrm_reg_ptr(&insn, regs);
> +		if (!reg)
> +			return false;
> +	}
> +
> +	ve->instr_len = insn.length;
> +
> +	switch (mmio) {
> +	case MMIO_WRITE:
> +		memcpy(&val, reg, size);
> +		return mmio_write(size, ve->gpa, val);
> +	case MMIO_WRITE_IMM:
> +		val = insn.immediate.value;
> +		return mmio_write(size, ve->gpa, val);
> +	case MMIO_READ:
> +	case MMIO_READ_ZERO_EXTEND:
> +	case MMIO_READ_SIGN_EXTEND:
> +		break;
> +	case MMIO_MOVS:
> +	case MMIO_DECODE_FAILED:
> +		/*
> +		 * MMIO was accessed with an instruction that could not be
> +		 * decoded or handled properly. It was likely not using io.h
> +		 * helpers or accessed MMIO accidentally.
> +		 */
> +		return false;
> +	default:
> +		/* Unknown insn_decode_mmio() decode value? */
> +		BUG();
> +	}

BUG()s are bad.  The set of insn_decode_mmio() return codes is known at
compile time.  If we're really on the lookout for unknown values, why
not just:

	BUILD_BUG_ON(NR_MMIO_TYPES != 6); // or whatever

Also, there are *lots* of ways for this function to just fall over and
fail.  Why does this particular failure mode deserve a BUG()?

Is there a reason a BUG() is better than returning failure which
presumably sets off the #GP-like logic?

Also, now that I've read this a few times, I've been confused by the
same thing a few times.  This is handling instructions that might read
or write or do both, correct?

Should that be made explicit in a function comment?

> +	/* Handle reads */
> +	if (!mmio_read(size, ve->gpa, &val))
> +		return false;
> +
> +	switch (mmio) {
> +	case MMIO_READ:
> +		/* Zero-extend for 32-bit operation */
> +		extend_size = size == 4 ? sizeof(*reg) : 0;
> +		break;
> +	case MMIO_READ_ZERO_EXTEND:
> +		/* Zero extend based on operand size */
> +		extend_size = insn.opnd_bytes;
> +		break;
> +	case MMIO_READ_SIGN_EXTEND:
> +		/* Sign extend based on operand size */
> +		extend_size = insn.opnd_bytes;
> +		if (size == 1 && val & BIT(7))
> +			extend_val = 0xFF;
> +		else if (size > 1 && val & BIT(15))
> +			extend_val = 0xFF;
> +		break;
> +	default:
> +		/* All other cases has to be covered with the first switch() */
> +		BUG();
> +	}
> +
> +	if (extend_size)
> +		memset(reg, extend_val, extend_size);
> +	memcpy(reg, &val, size);
> +	return true;
> +}
> +
>  void tdx_get_ve_info(struct ve_info *ve)
>  {
>  	struct tdx_module_output out;
> @@ -247,6 +359,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
>  		return write_msr(regs);
>  	case EXIT_REASON_CPUID:
>  		return handle_cpuid(regs);
> +	case EXIT_REASON_EPT_VIOLATION:
> +		return handle_mmio(regs, ve);
>  	default:
>  		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
>  		return false;


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 17/30] x86/tdx: Port I/O: add runtime hypercalls
  2022-03-02 14:27 ` [PATCHv5 17/30] x86/tdx: Port I/O: add runtime hypercalls Kirill A. Shutemov
@ 2022-03-08 21:30   ` Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2022-03-08 21:30 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On 3/2/22 06:27, Kirill A. Shutemov wrote:
> This series has no special handling for ioperm(). Users will be able
> to successfully request I/O permissions but will induce a #VE on
> their> first I/O instruction.

How will this be visible to users or user applications?

> +static bool handle_io(struct pt_regs *regs, u32 exit_qual)
> +{
> +	bool in;
> +	int size, port;
> +
> +	if (VE_IS_IO_STRING(exit_qual))
> +		return false;
> +
> +	in   = VE_IS_IO_IN(exit_qual);
> +	size = VE_GET_IO_SIZE(exit_qual);
> +	port = VE_GET_PORT_NUM(exit_qual);
> +
> +
> +	if (in)
> +		return handle_in(regs, size, port);
> +	else
> +		return handle_out(regs, size, port);
> +}

Some extra whitespace snuck in there.

With the question answered and whitespace fixed:

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 22/30] x86/boot: Set CR0.NE early and keep it set during the boot
  2022-03-02 14:27 ` [PATCHv5 22/30] x86/boot: Set CR0.NE early and keep it set during the boot Kirill A. Shutemov
@ 2022-03-08 21:37   ` Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2022-03-08 21:37 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On 3/2/22 06:27, Kirill A. Shutemov wrote:
> TDX guest requires CR0.NE to be set. Clearing the bit triggers #GP(0).
> 
> If CR0.NE is 0, the MS-DOS compatibility mode for handling floating-point
> exceptions is selected. In this mode, the software exception handler for
> floating-point exceptions is invoked externally using the processor’s
> FERR#, INTR, and IGNNE# pins.
> 
> Using FERR# and IGNNE# to handle floating-point exception is deprecated.
> CR0.NE=0 also limits newer processors to operate with one logical
> processor active.
> 
> Kernel uses CR0_STATE constant to initialize CR0. It has NE bit set.
> But during early boot kernel has more ad-hoc approach to setting bit
> in the register.

This walks right up to the problem but never actually comes out and says
what the problem is:

	During some of this ad-hoc manipulation, CR0.NE is cleared.
	This causes a #GP in TDX guests and makes it die in early boot.

> Make CR0 initialization consistent, deriving the initial value of CR0
> from CR0_STATE.

... and the solution:

	Since CR0_STATE always has CR0.NE=1, this ensures that CR0.NE is
	never 0 and avoids the #GP.

With the fixed changelog:

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 25/30] x86/tdx: Make pages shared in ioremap()
  2022-03-02 14:28 ` [PATCHv5 25/30] x86/tdx: Make pages shared in ioremap() Kirill A. Shutemov
@ 2022-03-08 22:02   ` Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2022-03-08 22:02 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On 3/2/22 06:28, Kirill A. Shutemov wrote:
> In TDX guests, guest memory is protected from host access. If a guest
> performs I/O, it needs to explicitly share the I/O memory with the host.
> 
> Make all ioremap()ed pages that are not backed by normal memory
> (IORES_DESC_NONE or IORES_DESC_RESERVED) mapped as shared.
> 
> Since TDX memory encryption support is similar to AMD SEV architecture,
> reuse the infrastructure from AMD SEV code.
> 
> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/mm/ioremap.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index 026031b3b782..a5d4ec1afca2 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -242,10 +242,15 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
>  	 * If the page being mapped is in memory and SEV is active then
>  	 * make sure the memory encryption attribute is enabled in the
>  	 * resulting mapping.
> +	 * In TDX guests, memory is marked private by default. If encryption
> +	 * is not requested (using encrypted), explicitly set decrypt
> +	 * attribute in all IOREMAPPED memory.
>  	 */

Nit: in this context, nobody knows what "private" means.

I'd probably just say this in the changelog:

	The permissions in PAGE_KERNEL_IO already work for "decrypted"
	memory on AMD SEV/SME systems.  That means that they have no
	need to make a pgprot_decrypted() call.

	TDX guests, on the other hand, _need_ change to PAGE_KERNEL_IO
	for "decrypted" mappings.  Add a pgprot_decrypted() for TDX.

I'm not sure you need a code comment.  There's really nothing that
mentions TDX in the code being commented.  If it needs clarification,
I'd do it behind the pgprot*() helpers.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCHv5.1 04/30] x86/tdx: Extend the confidential computing API to support TDX guests
  2022-03-08 20:17   ` Dave Hansen
@ 2022-03-09 16:01     ` Kirill A. Shutemov
  2022-03-09 18:36       ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-09 16:01 UTC (permalink / raw)
  To: dave.hansen
  Cc: aarcange, ak, bp, brijesh.singh, dan.j.williams, david, hpa,
	jgross, jmattson, joro, jpoimboe, kirill.shutemov, knsathya,
	linux-kernel, luto, mingo, pbonzini, peterz,
	sathyanarayanan.kuppuswamy, sdeep, seanjc, tglx, thomas.lendacky,
	tony.luck, vkuznets, wanpengli, x86

Confidential Computing (CC) features (like string I/O unroll support,
memory encryption/decryption support, etc) are conditionally enabled
in the kernel using cc_platform_has() API. Since TDX guests also need
to use these CC features, extend cc_platform_has() API and add TDX
guest-specific CC attributes support.

CC API also provides an interface to deal with encryption mask. Extend
it to cover TDX.

Details about which bit in the page table entry to be used to indicate
shared/private state is determined by using the TDINFO TDCALL.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig     |  1 +
 arch/x86/coco/core.c | 12 ++++++++++++
 arch/x86/coco/tdx.c  | 42 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c346d66b51fc..93e67842e369 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
 	bool "Intel TDX (Trust Domain Extensions) - Guest Support"
 	depends on X86_64 && CPU_SUP_INTEL
 	depends on X86_X2APIC
+	select ARCH_HAS_CC_PLATFORM
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index fc1365dd927e..6529db059938 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -87,9 +87,18 @@ EXPORT_SYMBOL_GPL(cc_platform_has);
 
 u64 cc_mkenc(u64 val)
 {
+	/*
+	 * Both AMD and Intel use a bit in page table to indicate encryption
+	 * status of the page.
+	 *
+	 * - for AMD, bit *set* means the page is encrypted
+	 * - for Intel *clear* means encrypted.
+	 */
 	switch (vendor) {
 	case CC_VENDOR_AMD:
 		return val | cc_mask;
+	case CC_VENDOR_INTEL:
+		return val & ~cc_mask;
 	default:
 		return val;
 	}
@@ -97,9 +106,12 @@ u64 cc_mkenc(u64 val)
 
 u64 cc_mkdec(u64 val)
 {
+	/* See comment in cc_mkenc() */
 	switch (vendor) {
 	case CC_VENDOR_AMD:
 		return val & ~cc_mask;
+	case CC_VENDOR_INTEL:
+		return val | cc_mask;
 	default:
 		return val;
 	}
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index d1ce35c1ac18..38b5a56f007f 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -5,8 +5,12 @@
 #define pr_fmt(fmt)     "tdx: " fmt
 
 #include <linux/cpufeature.h>
+#include <asm/coco.h>
 #include <asm/tdx.h>
 
+/* TDX module Call Leaf IDs */
+#define TDX_GET_INFO			1
+
 /*
  * Wrapper for standard use of __tdx_hypercall with no output aside from
  * return code.
@@ -25,8 +29,36 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
 	return __tdx_hypercall(&args, 0);
 }
 
+/*
+ * Wrapper for __tdx_module_call() for cases when the call doesn't suppose to
+ * fail. Panic if the call fails.
+ */
+static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+				   struct tdx_module_output *out)
+{
+	if (__tdx_module_call(fn, rcx, rdx, r8, r9, out))
+		panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
+}
+
+static void get_info(unsigned int *gpa_width)
+{
+	struct tdx_module_output out;
+
+	/*
+	 * TDINFO TDX module call is used to get the TD execution environment
+	 * information like GPA width, number of available vcpus, debug mode
+	 * information, etc. More details about the ABI can be found in TDX
+	 * Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
+	 * [TDG.VP.INFO].
+	 */
+	tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
+
+	*gpa_width = out.rcx & GENMASK(5, 0);
+}
+
 void __init tdx_early_init(void)
 {
+	unsigned int gpa_width;
 	u32 eax, sig[3];
 
 	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
@@ -36,5 +68,15 @@ void __init tdx_early_init(void)
 
 	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
+	get_info(&gpa_width);
+
+	cc_set_vendor(CC_VENDOR_INTEL);
+
+	/*
+	 * The highest bit of a guest physical address is the "sharing" bit.
+	 * Set it for shared pages and clear it for private pages.
+	 */
+	cc_set_mask(BIT_ULL(gpa_width - 1));
+
 	pr_info("Guest detected\n");
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH] x86/tdx: Handle CPUID via #VE
  2022-03-08 20:33   ` Dave Hansen
@ 2022-03-09 16:15     ` Kirill A. Shutemov
  0 siblings, 0 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-09 16:15 UTC (permalink / raw)
  To: dave.hansen
  Cc: aarcange, ak, bp, brijesh.singh, dan.j.williams, david, hpa,
	jgross, jmattson, joro, jpoimboe, kirill.shutemov, knsathya,
	linux-kernel, luto, mingo, pbonzini, peterz,
	sathyanarayanan.kuppuswamy, sdeep, seanjc, tglx, thomas.lendacky,
	tony.luck, vkuznets, wanpengli, x86, Dave Hansen

In TDX guests, most CPUID leaf/sub-leaf combinations are virtualized
by the TDX module while some trigger #VE.

Implement the #VE handling for EXIT_REASON_CPUID by handing it through
the hypercall, which in turn lets the TDX module handle it by invoking
the host VMM.

More details on CPUID Virtualization can be found in the TDX module
specification, the section titled "CPUID Virtualization".

Note that VMM that handles the hypercall is not trusted. It can return
data that may steer the guest kernel in wrong direct. Only allow  VMM
to control range reserved for hypervisor communication.

Return all-zeros for any CPUID outside the hypervisor range. It matches
CPU behaviour for non-supported leaf.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 arch/x86/coco/tdx.c | 58 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index a2f19c78583a..3d468a2b9ec6 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -163,6 +163,48 @@ static bool write_msr(struct pt_regs *regs)
 	return !__tdx_hypercall(&args, 0);
 }
 
+static bool handle_cpuid(struct pt_regs *regs)
+{
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = hcall_func(EXIT_REASON_CPUID),
+		.r12 = regs->ax,
+		.r13 = regs->cx,
+	};
+
+	/*
+	 * Only allow VMM to control range reserved for hypervisor
+	 * communication.
+	 *
+	 * Return all-zeros for any CPUID outside the range. It matches CPU
+	 * behaviour for non-supported leaf.
+	 */
+	if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
+		regs->ax = regs->bx = regs->cx = regs->dx = 0;
+		return true;
+	}
+
+	/*
+	 * Emulate the CPUID instruction via a hypercall. More info about
+	 * ABI can be found in TDX Guest-Host-Communication Interface
+	 * (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
+	 */
+	if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
+		return false;
+
+	/*
+	 * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
+	 * EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
+	 * So copy the register contents back to pt_regs.
+	 */
+	regs->ax = args.r12;
+	regs->bx = args.r13;
+	regs->cx = args.r14;
+	regs->dx = args.r15;
+
+	return true;
+}
+
 void tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
@@ -186,6 +228,18 @@ void tdx_get_ve_info(struct ve_info *ve)
 	ve->instr_info  = upper_32_bits(out.r10);
 }
 
+/* Handle the user initiated #VE */
+static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
+{
+	switch (ve->exit_reason) {
+	case EXIT_REASON_CPUID:
+		return handle_cpuid(regs);
+	default:
+		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+		return false;
+	}
+}
+
 /* Handle the kernel #VE */
 static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 {
@@ -196,6 +250,8 @@ static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 		return read_msr(regs);
 	case EXIT_REASON_MSR_WRITE:
 		return write_msr(regs);
+	case EXIT_REASON_CPUID:
+		return handle_cpuid(regs);
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		return false;
@@ -207,7 +263,7 @@ bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
 	bool ret;
 
 	if (user_mode(regs))
-		ret = false;
+		ret = virt_exception_user(regs, ve);
 	else
 		ret = virt_exception_kernel(regs, ve);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCHv5.1 01/30] x86/tdx: Detect running as a TDX guest in early boot
  2022-03-07 22:24         ` [PATCHv5.1 " Kirill A. Shutemov
@ 2022-03-09 18:22           ` Borislav Petkov
  0 siblings, 0 replies; 84+ messages in thread
From: Borislav Petkov @ 2022-03-09 18:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: aarcange, ak, brijesh.singh, dan.j.williams, dave.hansen,
	dave.hansen, david, hpa, jgross, jmattson, joro, jpoimboe,
	knsathya, linux-kernel, luto, mingo, pbonzini, peterz,
	sathyanarayanan.kuppuswamy, sdeep, seanjc, tglx, thomas.lendacky,
	tony.luck, vkuznets, wanpengli, x86

On Tue, Mar 08, 2022 at 01:24:56AM +0300, Kirill A. Shutemov wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> In preparation of extending cc_platform_has() API to support TDX guest,
> use CPUID instruction to detect support for TDX guests in the early
> boot code (via tdx_early_init()). Since copy_bootdata() is the first
> user of cc_platform_has() API, detect the TDX guest status before it.
> 
> Define a synthetic feature flag (X86_FEATURE_TDX_GUEST) and set this
> bit in a valid TDX guest platform.
> 
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
>  v5.1:
>    - Drop BUILD_BUG_ON()
> ---
>  arch/x86/Kconfig                         | 12 ++++++++++++
>  arch/x86/coco/Makefile                   |  2 ++
>  arch/x86/coco/tdx.c                      | 22 ++++++++++++++++++++++
>  arch/x86/include/asm/cpufeatures.h       |  1 +
>  arch/x86/include/asm/disabled-features.h |  8 +++++++-
>  arch/x86/include/asm/tdx.h               | 21 +++++++++++++++++++++
>  arch/x86/kernel/head64.c                 |  4 ++++
>  7 files changed, 69 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/coco/tdx.c
>  create mode 100644 arch/x86/include/asm/tdx.h

I don't know how many versions of this patch I've reviewed by now. Oh
well, finally:

Reviewed-by: Borislav Petkov <bp@suse.de>

:-)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5.1 04/30] x86/tdx: Extend the confidential computing API to support TDX guests
  2022-03-09 16:01     ` [PATCHv5.1 " Kirill A. Shutemov
@ 2022-03-09 18:36       ` Dave Hansen
  2022-03-09 23:51         ` [PATCHv5.2 " Kirill A. Shutemov
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2022-03-09 18:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: aarcange, ak, bp, brijesh.singh, dan.j.williams, david, hpa,
	jgross, jmattson, joro, jpoimboe, knsathya, linux-kernel, luto,
	mingo, pbonzini, peterz, sathyanarayanan.kuppuswamy, sdeep,
	seanjc, tglx, thomas.lendacky, tony.luck, vkuznets, wanpengli,
	x86

On 3/9/22 08:01, Kirill A. Shutemov wrote:
> +/*
> + * Wrapper for __tdx_module_call() for cases when the call doesn't suppose to
> + * fail. Panic if the call fails.
> + */
> +static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> +				   struct tdx_module_output *out)
> +{
> +	if (__tdx_module_call(fn, rcx, rdx, r8, r9, out))
> +		panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
> +}

That comment didn't do much for me.  I know it's a wrapper.  I know it
panics() if the call returns a failure.  That's what the code *does*.  I
want a comment to tell me *why* it does that.

I _think_ I may have been getting this confused with the TDVMCALL mechanism.

All TDVMCALLs that return with rax==0 are fatal, we jump right to a ud2
instruction.  A __tdx_module_call() (via TDCALL) with rax==0 doesn't
*have* to be fatal.  But, this establishes a policy that all TDCALLs via
tdx_module_call() *ARE* fatal.

How about this for a comment?

/*
 * Used for TDX guests to make calls directly to the TD module.  This
 * should only be used for calls that have no legitimate reason to fail
 * or where the kernel can not survive the call failing.
 */

That tells me a *LOT*: This is a guest -> TD module thing.  Not a host
thing, not a hypercall.  And, no the naming isn't good enough to tell me
that.  Also, it give me advice.  It tells me when I should use this
function.  If it look at the call site, it even makes sense.  A guest
can't even build a sane PTE without this call succeeding.  If *COURSE*
we panic() if the call fails.

You could even call this information out in the comment in get_info():


...
	 * The GPA width that comes out of this call is critical.  TDX
	 * guests can not meaningfully run without it.
	 */


Then it all kinda fits together.  Oh, this panic() is awfully harsh.
Oh, it's only supposed to be used for important things that the guest
really needs.  Then there's a comment about why it needs it so badly.

Otherwise the panic() just looks superfluous and mean.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 26/30] x86/mm/cpa: Add support for TDX shared memory
  2022-03-02 14:28 ` [PATCHv5 26/30] x86/mm/cpa: Add support for TDX shared memory Kirill A. Shutemov
@ 2022-03-09 19:44   ` Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2022-03-09 19:44 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

> +static bool tdx_tlb_flush_required(bool enc)
> +{
> +	/*
> +	 * TDX guest is responsible for flushing caches on private->shared

Caches?  In a "tlb" flushing function?  Did you mean paging structure
caches or CPU caches?

> +	 * transition. VMM is responsible for flushing on shared->private.
> +	 */
> +	return !enc;
> +}

It's also pretty nasty to have that argument called 'enc' when there's
no encryption in the comment.  That at least needs to be mentioned.

I'd also appreciate a mention somewhere of what the security/stability
model is here.  What if a malicious VMM doesn't flush on a
shared->private transition?  What is the fallout?  Who gets hurt?

> +static bool tdx_cache_flush_required(void)
> +{
> +	return true;
> +}

This leaves me totally in the dark.  I frankly don't know what
enc_tlb_flush_required does without looking, but I also don't know your
intent.  A one-liner about intent would be nice to ensure it matches
what enc_tlb_flush_required does.

> +static bool accept_page(phys_addr_t gpa, enum pg_level pg_level)
> +{
> +	/*
> +	 * Pass the page physical address to the TDX module to accept the
> +	 * pending, private page.
> +	 *
> +	 * Bits 2:0 of GPA encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
> +	 */
> +	switch (pg_level) {
> +	case PG_LEVEL_4K:
> +		break;
> +	case PG_LEVEL_2M:
> +		gpa |= 1;
> +		break;
> +	case PG_LEVEL_1G:
> +		gpa |= 2;
> +		break;
> +	default:
> +		return false;
> +	}

Just a style thing.  I'd much rather this be something like:

	u8 page_size;
	u64 tdcall_rcx;

	switch (pg_level) {
	case PG_LEVEL_4K:
		page_size = 0;
		break;
	case PG_LEVEL_2M:
		page_size = 1;
		break;
	case PG_LEVEL_1G:
		page_size = 2;
		break;
	default:
		return false;
	}

	tdcall_rcx = gpa | page_size;

BTW, the spec from August 2021 says these bits are "either 0 (4kb) or 1
(2MB)" on the spec.  No mention of 1G.


> +	return !__tdx_module_call(TDX_ACCEPT_PAGE, gpa, 0, 0, 0, NULL);
> +}

The TDX rcx register in the TDX_ACCEPT_PAGE is *NOT* the gpa.  It
contains the gpa in a few of its bits.  Doing it this way ^ makes it
painfully clear that argument is not solely a gpa.

> +/*
> + * Inform the VMM of the guest's intent for this physical page: shared with
> + * the VMM or private to the guest.  The VMM is expected to change its mapping
> + * of the page in response.
> + */
> +static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
> +{
> +	phys_addr_t start = __pa(vaddr);
> +	phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
> +
> +	if (!enc) {
> +		start |= cc_mkdec(0);
> +		end |= cc_mkdec(0);
> +	}
> +
> +	/*
> +	 * Notify the VMM about page mapping conversion. More info about ABI
> +	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
> +	 * section "TDG.VP.VMCALL<MapGPA>"
> +	 */
> +	if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
> +		return false;

This is really confusing.  "start" and "end" are physical addresses and
you're doing physical address math on them.  But they also have some
other bits encoded in them.

I *guess* that works.  If you've set the same bits in both, then you
subtract them, the bits cancel out.  But, it's horribly confusing.

Look how much more sane this is to read if we do a few things:

	phys_addr_t start = __pa(vaddr);
	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
			^ add vertical alignment
	phys_addr_t len_bytes = end - start;
	bool private = enc;

	if (!private) {
		/* Set the shared (decrypted) bits: */
		    ^ Note that we're helping the reader impedance-match
	 		between 'enc' and shared/private
		start |= cc_mkdec(0);
		end   |= cc_mkdec(0);
		    ^ more vertical alginment
	}

	if (__tdx_hypercall(TDVMCALL_MAP_GPA, start, len_bytes, 0, 0))
		return false;


> +	/* private->shared conversion  requires only MapGPA call */
> +	if (!enc)
> +		return true;
> +
> +	/*
> +	 * For shared->private conversion, accept the page using
> +	 * TDX_ACCEPT_PAGE TDX module call.
> +	 */
> +	while (start < end) {
> +		/* Try if 1G page accept is possible */
> +		if (!(start & ~PUD_MASK) && end - start >= PUD_SIZE &&
> +		    accept_page(start, PG_LEVEL_1G)) {
> +			start += PUD_SIZE;
> +			continue;
> +		}

This is rather ugly.  Why not just do a helper:

static int try_accept_one(phys_addr_t *start, unsigned long len,
unsigned long accept_size)
{
	int ret;

	if (!IS_ALIGNED(*start, accept_size));
		return -ESOMETHING;
	if (len < accept_size)
		return -ESOMETHING;

	ret = accept_page(start, size_to_level(accept_size));

	if (!ret)
		*start += accept_size;

	return ret;
}

Then the loop becomes actually readable:

	while (start < end) {
		len = end - start;

		/* Try larger accepts first because... */	

		ret = try_accept_one(&start, len, PUD_SIZE);
		if (ret)
			continue;

		ret = try_accept_one(&start, len, PMD_SIZE);
		if (ret)
			continue;

		ret = try_accept_one(&start, len, PTE_SIZE);
		if (ret)
			return false;
	}



> +
> +		/* Try if 2M page accept is possible */
> +		if (!(start & ~PMD_MASK) && end - start >= PMD_SIZE &&
> +		    accept_page(start, PG_LEVEL_2M)) {
> +			start += PMD_SIZE;
> +			continue;
> +		}
> +
> +		if (!accept_page(start, PG_LEVEL_4K))
> +			return false;
> +		start += PAGE_SIZE;
> +	}
> +
> +	return true;
> +}
> +
>  void __init tdx_early_init(void)
>  {
>  	unsigned int gpa_width;
> @@ -526,5 +623,9 @@ void __init tdx_early_init(void)
>  	 */
>  	cc_set_mask(BIT_ULL(gpa_width - 1));
>  
> +	x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
> +	x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
> +	x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;

Could you double-check for vertical alignment opportunities in this
patch?  This is a place where two spaces can at least tell you quickly
that this is all TDX-specific:

	x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
	x86_platform.guest.enc_tlb_flush_required   = tdx_tlb_flush_required;
	x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;

>  	pr_info("Guest detected\n");
>  }
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 1c3cb952fa2a..080f21171b27 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -1308,7 +1308,7 @@ static void ve_raise_fault(struct pt_regs *regs, long error_code)
>   *
>   * In the settings that Linux will run in, virtualization exceptions are
>   * never generated on accesses to normal, TD-private memory that has been
> - * accepted.
> + * accepted (by BIOS or with tdx_enc_status_changed()).
>   *
>   * Syscall entry code has a critical window where the kernel stack is not
>   * yet set up. Any exception in this window leads to hard to debug issues


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 27/30] x86/kvm: Use bounce buffers for TD guest
  2022-03-02 14:28 ` [PATCHv5 27/30] x86/kvm: Use bounce buffers for TD guest Kirill A. Shutemov
@ 2022-03-09 20:07   ` Dave Hansen
  2022-03-10 14:29     ` Tom Lendacky
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2022-03-09 20:07 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On 3/2/22 06:28, Kirill A. Shutemov wrote:
> --- a/arch/x86/coco/tdx.c
> +++ b/arch/x86/coco/tdx.c
> @@ -5,6 +5,7 @@
>  #define pr_fmt(fmt)     "tdx: " fmt
>  
>  #include <linux/cpufeature.h>
> +#include <linux/swiotlb.h>
>  #include <asm/coco.h>
>  #include <asm/tdx.h>
>  #include <asm/vmx.h>
> @@ -627,5 +628,7 @@ void __init tdx_early_init(void)
>  	x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
>  	x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
>  
> +	swiotlb_force = SWIOTLB_FORCE;
> +
>  	pr_info("Guest detected\n");
>  }

AMD currently does:

        if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
                swiotlb_force = SWIOTLB_FORCE;

which somewhat begs the question of why we can't do the

	swiotlb_force = SWIOTLB_FORCE;

thing in:

void __init mem_encrypt_init(void)
{
        if (!cc_platform_has(CC_ATTR_MEM_ENCRYPT))
                return;

/// Here

I recall there being a reason for this.  But I don't see any mention in
the changelog.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 28/30] x86/tdx: ioapic: Add shared bit for IOAPIC base address
  2022-03-02 14:28 ` [PATCHv5 28/30] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kirill A. Shutemov
@ 2022-03-09 20:39   ` Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2022-03-09 20:39 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel, Isaku Yamahata

On 3/2/22 06:28, Kirill A. Shutemov wrote:
> +static void io_apic_set_fixmap_nocache(enum fixed_addresses idx,
> +				       phys_addr_t phys)
> +{
> +	pgprot_t flags = FIXMAP_PAGE_NOCACHE;
> +
> +	flags = pgprot_decrypted(flags);
> +	__set_fixmap(idx, phys, flags);
> +}

This is only used by the "io_apic".  No need to add the "_nocache".  Maybe:

static void io_apic_set_fixmap(enum fixed_addresses idx, ...
{
	pgprot_t flags = FIXMAP_PAGE_NOCACHE;

	/*
	 * Ensure fixmaps for IOAPIC MMIO respect memory
	 * encryption pgprot bits, just like normal ioremap():
	 */
	flags = pgprot_decrypted(flags);

	__set_fixmap(idx, phys, flags);
}

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 29/30] ACPICA: Avoid cache flush inside virtual machines
  2022-03-02 14:28 ` [PATCHv5 29/30] ACPICA: Avoid cache flush inside virtual machines Kirill A. Shutemov
  2022-03-02 16:13   ` Dan Williams
@ 2022-03-09 20:56   ` Dave Hansen
  1 sibling, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2022-03-09 20:56 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On 3/2/22 06:28, Kirill A. Shutemov wrote:
> WBINVD is not supported in TDX guest and triggers #VE. There's no robust
> way to emulate it. The kernel has to avoid it.

Not really.  It could just ignore any use of WBVIND.  That's why
hypervisors mostly do today.

> ACPI_FLUSH_CPU_CACHE() flushes caches usign WBINVD on entering sleep

					 ^ using
			
> states. It is required to prevent data loss.
> 
> While running inside virtual machine, the kernel can bypass cache
> flushing. Changing sleep state in a virtual machine doesn't affect the
> host system sleep state and cannot lead to data loss.

How's this?

Before entering sleep states, the ACPI code flushes caches to prevent
data loss using the WBINVD instruction.  This mechanism is required on
bare metal.

But, any use WBINVD inside of a guest is worthless.  Changing sleep
state in a virtual machine doesn't affect the host system sleep state
and cannot lead to data loss, so most hypervisors simply ignore it.
Despite this, the ACPI code calls WBINVD unconditionally anyway.  It's
useless, but also normally harmless.

In TDX guests, though, WBINVD stops being harmless; it triggers a
virtualization exception (#VE).  If the ACPI cache-flushing WBINVD were
left in place, TDX guests would need handling to recover from the exception.

Avoid using WBINVD whenever running under a hypervisor.  This both
removes the useless WBINVDs and saves TDX from implementing WBINVD handling.

> diff --git a/arch/x86/include/asm/acenv.h b/arch/x86/include/asm/acenv.h
> index 9aff97f0de7f..d937c55e717e 100644
> --- a/arch/x86/include/asm/acenv.h
> +++ b/arch/x86/include/asm/acenv.h
> @@ -13,7 +13,19 @@
>  
>  /* Asm macros */
>  
> -#define ACPI_FLUSH_CPU_CACHE()	wbinvd()
> +/*
> + * ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states.
> + * It is required to prevent data loss.
> + *
> + * While running inside virtual machine, the kernel can bypass cache flushing.
> + * Changing sleep state in a virtual machine doesn't affect the host system
> + * sleep state and cannot lead to data loss.
> + */
> +#define ACPI_FLUSH_CPU_CACHE()					\
> +do {								\
> +	if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))	\
> +		wbinvd();					\
> +} while (0)
>  
>  int __acpi_acquire_global_lock(unsigned int *lock);
>  int __acpi_release_global_lock(unsigned int *lock);


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 30/30] Documentation/x86: Document TDX kernel architecture
  2022-03-02 14:28 ` [PATCHv5 30/30] Documentation/x86: Document TDX kernel architecture Kirill A. Shutemov
@ 2022-03-09 21:49   ` Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2022-03-09 21:49 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On 3/2/22 06:28, Kirill A. Shutemov wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> Document the TDX guest architecture details like #VE support,
> shared memory, etc.
> 
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCHv5.2 04/30] x86/tdx: Extend the confidential computing API to support TDX guests
  2022-03-09 18:36       ` Dave Hansen
@ 2022-03-09 23:51         ` Kirill A. Shutemov
  2022-03-10  0:07           ` Dave Hansen
  2022-03-15 19:41           ` Borislav Petkov
  0 siblings, 2 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-09 23:51 UTC (permalink / raw)
  To: dave.hansen
  Cc: aarcange, ak, bp, brijesh.singh, dan.j.williams, david, hpa,
	jgross, jmattson, joro, jpoimboe, kirill.shutemov, knsathya,
	linux-kernel, luto, mingo, pbonzini, peterz,
	sathyanarayanan.kuppuswamy, sdeep, seanjc, tglx, thomas.lendacky,
	tony.luck, vkuznets, wanpengli, x86

Confidential Computing (CC) features (like string I/O unroll support,
memory encryption/decryption support, etc) are conditionally enabled
in the kernel using cc_platform_has() API. Since TDX guests also need
to use these CC features, extend cc_platform_has() API and add TDX
guest-specific CC attributes support.

CC API also provides an interface to deal with encryption mask. Extend
it to cover TDX.

Details about which bit in the page table entry to be used to indicate
shared/private state is determined by using the TDINFO TDCALL.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 v5.2:
  - Update comment for tdx_module_call() definition and for the
    TDX_GET_INFO call site.
---
 arch/x86/Kconfig     |  1 +
 arch/x86/coco/core.c | 12 ++++++++++++
 arch/x86/coco/tdx.c  | 46 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 59 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c346d66b51fc..93e67842e369 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
 	bool "Intel TDX (Trust Domain Extensions) - Guest Support"
 	depends on X86_64 && CPU_SUP_INTEL
 	depends on X86_X2APIC
+	select ARCH_HAS_CC_PLATFORM
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index fc1365dd927e..6529db059938 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -87,9 +87,18 @@ EXPORT_SYMBOL_GPL(cc_platform_has);
 
 u64 cc_mkenc(u64 val)
 {
+	/*
+	 * Both AMD and Intel use a bit in page table to indicate encryption
+	 * status of the page.
+	 *
+	 * - for AMD, bit *set* means the page is encrypted
+	 * - for Intel *clear* means encrypted.
+	 */
 	switch (vendor) {
 	case CC_VENDOR_AMD:
 		return val | cc_mask;
+	case CC_VENDOR_INTEL:
+		return val & ~cc_mask;
 	default:
 		return val;
 	}
@@ -97,9 +106,12 @@ u64 cc_mkenc(u64 val)
 
 u64 cc_mkdec(u64 val)
 {
+	/* See comment in cc_mkenc() */
 	switch (vendor) {
 	case CC_VENDOR_AMD:
 		return val & ~cc_mask;
+	case CC_VENDOR_INTEL:
+		return val | cc_mask;
 	default:
 		return val;
 	}
diff --git a/arch/x86/coco/tdx.c b/arch/x86/coco/tdx.c
index d1ce35c1ac18..b74b3f70f584 100644
--- a/arch/x86/coco/tdx.c
+++ b/arch/x86/coco/tdx.c
@@ -5,8 +5,12 @@
 #define pr_fmt(fmt)     "tdx: " fmt
 
 #include <linux/cpufeature.h>
+#include <asm/coco.h>
 #include <asm/tdx.h>
 
+/* TDX module Call Leaf IDs */
+#define TDX_GET_INFO			1
+
 /*
  * Wrapper for standard use of __tdx_hypercall with no output aside from
  * return code.
@@ -25,8 +29,40 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
 	return __tdx_hypercall(&args, 0);
 }
 
+/*
+ * Used for TDX guests to make calls directly to the TD module.  This
+ * should only be used for calls that have no legitimate reason to fail
+ * or where the kernel can not survive the call failing.
+ */
+static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+				   struct tdx_module_output *out)
+{
+	if (__tdx_module_call(fn, rcx, rdx, r8, r9, out))
+		panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
+}
+
+static void get_info(unsigned int *gpa_width)
+{
+	struct tdx_module_output out;
+
+	/*
+	 * TDINFO TDX module call is used to get the TD execution environment
+	 * information like GPA width, number of available vcpus, debug mode
+	 * information, etc. More details about the ABI can be found in TDX
+	 * Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
+	 * [TDG.VP.INFO].
+	 *
+	 * The GPA width that comes out of this call is critical. TDX guests
+	 * can not meaningfully run without it.
+	 */
+	tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
+
+	*gpa_width = out.rcx & GENMASK(5, 0);
+}
+
 void __init tdx_early_init(void)
 {
+	unsigned int gpa_width;
 	u32 eax, sig[3];
 
 	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
@@ -36,5 +72,15 @@ void __init tdx_early_init(void)
 
 	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
+	get_info(&gpa_width);
+
+	cc_set_vendor(CC_VENDOR_INTEL);
+
+	/*
+	 * The highest bit of a guest physical address is the "sharing" bit.
+	 * Set it for shared pages and clear it for private pages.
+	 */
+	cc_set_mask(BIT_ULL(gpa_width - 1));
+
 	pr_info("Guest detected\n");
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCHv5.2 04/30] x86/tdx: Extend the confidential computing API to support TDX guests
  2022-03-09 23:51         ` [PATCHv5.2 " Kirill A. Shutemov
@ 2022-03-10  0:07           ` Dave Hansen
  2022-03-15 19:41           ` Borislav Petkov
  1 sibling, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2022-03-10  0:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: aarcange, ak, bp, brijesh.singh, dan.j.williams, david, hpa,
	jgross, jmattson, joro, jpoimboe, knsathya, linux-kernel, luto,
	mingo, pbonzini, peterz, sathyanarayanan.kuppuswamy, sdeep,
	seanjc, tglx, thomas.lendacky, tony.luck, vkuznets, wanpengli,
	x86

On 3/9/22 15:51, Kirill A. Shutemov wrote:
> Confidential Computing (CC) features (like string I/O unroll support,
> memory encryption/decryption support, etc) are conditionally enabled
> in the kernel using cc_platform_has() API. Since TDX guests also need
> to use these CC features, extend cc_platform_has() API and add TDX
> guest-specific CC attributes support.
> 
> CC API also provides an interface to deal with encryption mask. Extend
> it to cover TDX.
> 
> Details about which bit in the page table entry to be used to indicate
> shared/private state is determined by using the TDINFO TDCALL.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  v5.2:
>   - Update comment for tdx_module_call() definition and for the
>     TDX_GET_INFO call site.

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO
  2022-03-08 21:26   ` Dave Hansen
@ 2022-03-10  0:51     ` Kirill A. Shutemov
  2022-03-10  1:06       ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-10  0:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: tglx, mingo, bp, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, thomas.lendacky, brijesh.singh, x86, linux-kernel

On Tue, Mar 08, 2022 at 01:26:28PM -0800, Dave Hansen wrote:
> On 3/2/22 06:27, Kirill A. Shutemov wrote:
> > In non-TDX VMs, MMIO is implemented by providing the guest a mapping
> > which will cause a VMEXIT on access and then the VMM emulating the
> > instruction that caused the VMEXIT. That's not possible for TDX VM.
> > 
> > To emulate an instruction an emulator needs two things:
> > 
> >   - R/W access to the register file to read/modify instruction arguments
> >     and see RIP of the faulted instruction.
> > 
> >   - Read access to memory where instruction is placed to see what to
> >     emulate. In this case it is guest kernel text.
> > 
> > Both of them are not available to VMM in TDX environment:
> > 
> >   - Register file is never exposed to VMM. When a TD exits to the module,
> >     it saves registers into the state-save area allocated for that TD.
> >     The module then scrubs these registers before returning execution
> >     control to the VMM, to help prevent leakage of TD state.
> > 
> >   - Memory is encrypted a TD-private key. The CPU disallows software
> >     other than the TDX module and TDs from making memory accesses using
> >     the private key.
> 
> Memory encryption has zero to do with this.  The TDX isolation
> mechanisms are totally discrete from memory encryption, although they
> are "neighbors" of sorts.

Hm. I don't see why you say encryption is not relevant. VMM (host kernel)
has ultimate access to guest memory cypher text. It can read it as
cypher text without any issue (using KeyID-0).

Could you elaborate on the point?

> > == Patching TDX drivers ==
> > 
> > Rather than touching the entire kernel, it might also be possible to
> > just go after drivers that use MMIO in TDX guests.  Right now, that's
> > limited only to virtio and some x86-specific drivers.
> > 
> > All virtio MMIO appears to be done through a single function, which
> > makes virtio eminently easy to patch.
> > 
> > This approach will be adopted in the future, removing the bulk of
> > MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
> 
> This still doesn't *quite* do it for me for a justification.  Why can't
> the non-virtio cases be converted as well?  Why doesn't the "patching
> MMIO sites" work for x86 code too?
> 
> You really need to convince us that *this* approach will be required
> forever.

What if I add:

	Many drivers can potentially be used inside TDX guest (e.g. via device
	passthough or random device emulation by VMM), but very few will.
	Patching every possible driver is not practical. #VE-based MMIO provides
	functionality for everybody. Performance-critical cases can be optimized
	as needed.

?

> > +static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> > +{
> > +	char buffer[MAX_INSN_SIZE];
> > +	unsigned long *reg, val;
> > +	struct insn insn = {};
> > +	enum mmio_type mmio;
> > +	int size, extend_size;
> > +	u8 extend_val = 0;
> > +
> > +	if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
> > +		return false;
> > +
> > +	if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
> > +		return false;
> > +
> > +	mmio = insn_decode_mmio(&insn, &size);
> > +	if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
> > +		return false;
> > +
> > +	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
> > +		reg = insn_get_modrm_reg_ptr(&insn, regs);
> > +		if (!reg)
> > +			return false;
> > +	}
> > +
> > +	ve->instr_len = insn.length;
> > +
> > +	switch (mmio) {
> > +	case MMIO_WRITE:
> > +		memcpy(&val, reg, size);
> > +		return mmio_write(size, ve->gpa, val);
> > +	case MMIO_WRITE_IMM:
> > +		val = insn.immediate.value;
> > +		return mmio_write(size, ve->gpa, val);
> > +	case MMIO_READ:
> > +	case MMIO_READ_ZERO_EXTEND:
> > +	case MMIO_READ_SIGN_EXTEND:
> > +		break;
> > +	case MMIO_MOVS:
> > +	case MMIO_DECODE_FAILED:
> > +		/*
> > +		 * MMIO was accessed with an instruction that could not be
> > +		 * decoded or handled properly. It was likely not using io.h
> > +		 * helpers or accessed MMIO accidentally.
> > +		 */
> > +		return false;
> > +	default:
> > +		/* Unknown insn_decode_mmio() decode value? */
> > +		BUG();
> > +	}
> 
> BUG()s are bad.  The set of insn_decode_mmio() return codes is known at
> compile time.  If we're really on the lookout for unknown values, why
> not just:
> 
> 	BUILD_BUG_ON(NR_MMIO_TYPES != 6); // or whatever

This doesn't work.

We can pretend that the function only forced to return values from the
enum. The truth is that it can return whatever int it wants. Type system
in C is too week to guarantee anything here. The BUG() is backstop for it.

This BUILD_BUG_ON() is useless. Compiler complains about missing case in
the switch anyway.

> Also, there are *lots* of ways for this function to just fall over and
> fail.  Why does this particular failure mode deserve a BUG()?
> 
> Is there a reason a BUG() is better than returning failure which
> presumably sets off the #GP-like logic?

BUG() here makes it clear that the handler itself is buggy. Returning
false and kicking in #GP-like logic indicates that something wrong with
the code that triggered #VE. I think it is an important distinction.

> Also, now that I've read this a few times, I've been confused by the
> same thing a few times.  This is handling instructions that might read
> or write or do both, correct?
> 
> Should that be made explicit in a function comment?

Hm. Okay. Something like

/* Handle reads from and writes to MMIO region. */

before the function?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO
  2022-03-10  0:51     ` Kirill A. Shutemov
@ 2022-03-10  1:06       ` Dave Hansen
  2022-03-10 16:48         ` Kirill A. Shutemov
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2022-03-10  1:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, bp, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, thomas.lendacky, brijesh.singh, x86, linux-kernel

On 3/9/22 16:51, Kirill A. Shutemov wrote:
> On Tue, Mar 08, 2022 at 01:26:28PM -0800, Dave Hansen wrote:
>> Memory encryption has zero to do with this.  The TDX isolation
>> mechanisms are totally discrete from memory encryption, although they
>> are "neighbors" of sorts.
> 
> Hm. I don't see why you say encryption is not relevant. VMM (host kernel)
> has ultimate access to guest memory cypher text. It can read it as
> cypher text without any issue (using KeyID-0).
> 
> Could you elaborate on the point?

I think you're just confusing what TDX has with MKTME.  The whitepaper says:

> The TD-bit associated with the line in memory seeks to
> detect software or devices attempting to read memory
> encrypted with private KeyID, using a shared KeyID, to reveal
> the ciphertext. On such accesses, the MKTME returns a fixed
> pattern to prevent ciphertext analysis.

I think several firstborn were sacrificed to get that bit.  Let's not
forget why we have it. :)

>>> Rather than touching the entire kernel, it might also be possible to
>>> just go after drivers that use MMIO in TDX guests.  Right now, that's
>>> limited only to virtio and some x86-specific drivers.
>>>
>>> All virtio MMIO appears to be done through a single function, which
>>> makes virtio eminently easy to patch.
>>>
>>> This approach will be adopted in the future, removing the bulk of
>>> MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
>>
>> This still doesn't *quite* do it for me for a justification.  Why can't
>> the non-virtio cases be converted as well?  Why doesn't the "patching
>> MMIO sites" work for x86 code too?
>>
>> You really need to convince us that *this* approach will be required
>> forever.
> 
> What if I add:
> 
> 	Many drivers can potentially be used inside TDX guest (e.g. via device
> 	passthough or random device emulation by VMM), but very few will.
> 	Patching every possible driver is not practical. #VE-based MMIO provides
> 	functionality for everybody. Performance-critical cases can be optimized
> 	as needed.

This problem was laid out as having three cases:
1. virtio
2. x86-specific drivers
3. random drivers (everything else)

#1 could be done with paravirt
#2 is unspecified and unknown
#3 use doesn't as far as I know exist in TDX guests today

>>> +static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
>>> +{
>>> +	char buffer[MAX_INSN_SIZE];
>>> +	unsigned long *reg, val;
>>> +	struct insn insn = {};
>>> +	enum mmio_type mmio;
>>> +	int size, extend_size;
>>> +	u8 extend_val = 0;
>>> +
>>> +	if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
>>> +		return false;
>>> +
>>> +	if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
>>> +		return false;
>>> +
>>> +	mmio = insn_decode_mmio(&insn, &size);
>>> +	if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
>>> +		return false;
>>> +
>>> +	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
>>> +		reg = insn_get_modrm_reg_ptr(&insn, regs);
>>> +		if (!reg)
>>> +			return false;
>>> +	}
>>> +
>>> +	ve->instr_len = insn.length;
>>> +
>>> +	switch (mmio) {
>>> +	case MMIO_WRITE:
>>> +		memcpy(&val, reg, size);
>>> +		return mmio_write(size, ve->gpa, val);
>>> +	case MMIO_WRITE_IMM:
>>> +		val = insn.immediate.value;
>>> +		return mmio_write(size, ve->gpa, val);
>>> +	case MMIO_READ:
>>> +	case MMIO_READ_ZERO_EXTEND:
>>> +	case MMIO_READ_SIGN_EXTEND:
>>> +		break;
>>> +	case MMIO_MOVS:
>>> +	case MMIO_DECODE_FAILED:
>>> +		/*
>>> +		 * MMIO was accessed with an instruction that could not be
>>> +		 * decoded or handled properly. It was likely not using io.h
>>> +		 * helpers or accessed MMIO accidentally.
>>> +		 */
>>> +		return false;
>>> +	default:
>>> +		/* Unknown insn_decode_mmio() decode value? */
>>> +		BUG();
>>> +	}
>>
>> BUG()s are bad.  The set of insn_decode_mmio() return codes is known at
>> compile time.  If we're really on the lookout for unknown values, why
>> not just:
>>
>> 	BUILD_BUG_ON(NR_MMIO_TYPES != 6); // or whatever
> 
> This doesn't work.
> 
> We can pretend that the function only forced to return values from the
> enum. The truth is that it can return whatever int it wants. Type system
> in C is too week to guarantee anything here. The BUG() is backstop for it.
> 
> This BUILD_BUG_ON() is useless. Compiler complains about missing case in
> the switch anyway.
> 
>> Also, there are *lots* of ways for this function to just fall over and
>> fail.  Why does this particular failure mode deserve a BUG()?
>>
>> Is there a reason a BUG() is better than returning failure which
>> presumably sets off the #GP-like logic?
> 
> BUG() here makes it clear that the handler itself is buggy. Returning
> false and kicking in #GP-like logic indicates that something wrong with
> the code that triggered #VE. I think it is an important distinction.

OK, then how about a WARN_ON() which is followed by the #GP?

Let's say insn_decode_mmio() does something insane like:

	return -EINVAL;

Should we really be killing the kernel for that?

>> Also, now that I've read this a few times, I've been confused by the
>> same thing a few times.  This is handling instructions that might read
>> or write or do both, correct?
>>
>> Should that be made explicit in a function comment?
> 
> Hm. Okay. Something like
> 
> /* Handle reads from and writes to MMIO region. */
> 
> before the function?

There are really two halves to the function.  It would be nice to be
explicit about how the function is laid out and why reads take a bit
more handling at the end than writes.

That might be obvious to someone who has written an emulator or two but
it would be nice to include for the uninitiated.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 02/30] x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers
  2022-03-02 14:27 ` [PATCHv5 02/30] x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers Kirill A. Shutemov
  2022-03-08 19:56   ` Dave Hansen
@ 2022-03-10 12:32   ` Borislav Petkov
  2022-03-10 14:44     ` Kirill A. Shutemov
  1 sibling, 1 reply; 84+ messages in thread
From: Borislav Petkov @ 2022-03-10 12:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On Wed, Mar 02, 2022 at 05:27:38PM +0300, Kirill A. Shutemov wrote:
>  arch/x86/include/asm/tdx.h    | 28 +++++++++++
>  arch/x86/kernel/asm-offsets.c |  9 ++++
>  arch/x86/virt/tdxcall.S       | 95 +++++++++++++++++++++++++++++++++++

Right, you asked already about putting this under arch/x86/virt/ but on
a second thought, this doesn't look like

"- generic host virtualization stuff: arch/x86/virt/"

to me:

https://lore.kernel.org/r/Yg5nh1RknPRwIrb8@zn.tnic

Rather, this looks like it wants to be under

 arch/x86/virt/vmx/tdx/

where we said that this should be the coco host code place.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 27/30] x86/kvm: Use bounce buffers for TD guest
  2022-03-09 20:07   ` Dave Hansen
@ 2022-03-10 14:29     ` Tom Lendacky
  2022-03-10 14:51       ` Christoph Hellwig
  0 siblings, 1 reply; 84+ messages in thread
From: Tom Lendacky @ 2022-03-10 14:29 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, brijesh.singh, x86,
	linux-kernel

On 3/9/22 14:07, Dave Hansen wrote:
> On 3/2/22 06:28, Kirill A. Shutemov wrote:
>> --- a/arch/x86/coco/tdx.c
>> +++ b/arch/x86/coco/tdx.c
>> @@ -5,6 +5,7 @@
>>   #define pr_fmt(fmt)     "tdx: " fmt
>>   
>>   #include <linux/cpufeature.h>
>> +#include <linux/swiotlb.h>
>>   #include <asm/coco.h>
>>   #include <asm/tdx.h>
>>   #include <asm/vmx.h>
>> @@ -627,5 +628,7 @@ void __init tdx_early_init(void)
>>   	x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
>>   	x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
>>   
>> +	swiotlb_force = SWIOTLB_FORCE;
>> +
>>   	pr_info("Guest detected\n");
>>   }
> 
> AMD currently does:
> 
>          if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
>                  swiotlb_force = SWIOTLB_FORCE;
> 
> which somewhat begs the question of why we can't do the
> 
> 	swiotlb_force = SWIOTLB_FORCE;
> 
> thing in:
> 
> void __init mem_encrypt_init(void)
> {
>          if (!cc_platform_has(CC_ATTR_MEM_ENCRYPT))

If you make this cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT), then it 
should work for both, I would think. If you use CC_ATTR_MEM_ENCRYPT, 
you'll force bare-metal SME to always use bounce buffers when doing I/O. 
But SME can do I/O to encrypted memory if the device supports 64-bit DMA 
or if the IOMMU is being used, so we don't want to force SWIOTLB in this case.

Thanks,
Tom

>                  return;
> 
> /// Here
> 
> I recall there being a reason for this.  But I don't see any mention in
> the changelog.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 02/30] x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers
  2022-03-10 12:32   ` Borislav Petkov
@ 2022-03-10 14:44     ` Kirill A. Shutemov
  2022-03-10 14:51       ` Borislav Petkov
  0 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-10 14:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On Thu, Mar 10, 2022 at 01:32:40PM +0100, Borislav Petkov wrote:
> On Wed, Mar 02, 2022 at 05:27:38PM +0300, Kirill A. Shutemov wrote:
> >  arch/x86/include/asm/tdx.h    | 28 +++++++++++
> >  arch/x86/kernel/asm-offsets.c |  9 ++++
> >  arch/x86/virt/tdxcall.S       | 95 +++++++++++++++++++++++++++++++++++
> 
> Right, you asked already about putting this under arch/x86/virt/ but on
> a second thought, this doesn't look like
> 
> "- generic host virtualization stuff: arch/x86/virt/"
> 
> to me:
> 
> https://lore.kernel.org/r/Yg5nh1RknPRwIrb8@zn.tnic
> 
> Rather, this looks like it wants to be under
> 
>  arch/x86/virt/vmx/tdx/
> 
> where we said that this should be the coco host code place.

I'm fine moving where you want. But I want to make sure we are on the same
page: this code is common for guest and host TDX. I think VMX referes more
to host side of the thing, no?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 02/30] x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers
  2022-03-10 14:44     ` Kirill A. Shutemov
@ 2022-03-10 14:51       ` Borislav Petkov
  0 siblings, 0 replies; 84+ messages in thread
From: Borislav Petkov @ 2022-03-10 14:51 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On Thu, Mar 10, 2022 at 05:44:04PM +0300, Kirill A. Shutemov wrote:
> I'm fine moving where you want. But I want to make sure we are on the same
> page: this code is common for guest and host TDX. I think VMX referes more
> to host side of the thing, no?

Well, that patch has host-side stuff too.

If we have to be pedantic, this should be in

arch/x86/virt/shared/vmx/tdx

or so but that's bikeshedding gone out of control to me.

And it isn't generic, as pointed out earlier, so arch/x86/virt/ does not
fit either.

So I'd think of something with "tdx" in the pathname and
arch/x86/virt/vmx/tdx/ is kinda the only one we agreed upon as a path...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 27/30] x86/kvm: Use bounce buffers for TD guest
  2022-03-10 14:29     ` Tom Lendacky
@ 2022-03-10 14:51       ` Christoph Hellwig
  0 siblings, 0 replies; 84+ messages in thread
From: Christoph Hellwig @ 2022-03-10 14:51 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Dave Hansen, Kirill A. Shutemov, tglx, mingo, bp, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, brijesh.singh, x86,
	linux-kernel

On Thu, Mar 10, 2022 at 08:29:01AM -0600, Tom Lendacky wrote:
> > void __init mem_encrypt_init(void)
> > {
> >          if (!cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> 
> If you make this cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT), then it should
> work for both, I would think. If you use CC_ATTR_MEM_ENCRYPT, you'll force
> bare-metal SME to always use bounce buffers when doing I/O. But SME can do
> I/O to encrypted memory if the device supports 64-bit DMA or if the IOMMU is
> being used, so we don't want to force SWIOTLB in this case.

http://git.infradead.org/users/hch/misc.git/commitdiff/18b0547fe0467cb48e64ee403f50f2587fe04e3a

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 03/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-03-02 14:27 ` [PATCHv5 03/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions Kirill A. Shutemov
  2022-03-08 20:03   ` Dave Hansen
@ 2022-03-10 15:30   ` Borislav Petkov
  2022-03-10 21:20     ` Kirill A. Shutemov
  1 sibling, 1 reply; 84+ messages in thread
From: Borislav Petkov @ 2022-03-10 15:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On Wed, Mar 02, 2022 at 05:27:39PM +0300, Kirill A. Shutemov wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> Guests communicate with VMMs with hypercalls. Historically, these
> are implemented using instructions that are known to cause VMEXITs
> like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
> expose the guest state to the host. This prevents the old hypercall
> mechanisms from working. So, to communicate with VMM, TDX
> specification defines a new instruction called TDCALL.
> 
> In a TDX based VM, since the VMM is an untrusted entity, an intermediary
> layer -- TDX module -- facilitates secure communication between the host
> and the guest. TDX module is loaded like a firmware into a special CPU
> mode called SEAM. TDX guests communicate with the TDX module using the
> TDCALL instruction.
> 
> A guest uses TDCALL to communicate with both the TDX module and VMM.
> The value of the RAX register when executing the TDCALL instruction is
> used to determine the TDCALL type. A variant of TDCALL used to communicate
> with the VMM is called TDVMCALL.
> 
> Add generic interfaces to communicate with the TDX module and VMM
> (using the TDCALL instruction).
> 
> __tdx_hypercall()    - Used by the guest to request services from the
> 		       VMM (via TDVMCALL).
> __tdx_module_call()  - Used to communicate with the TDX module (via
> 		       TDCALL).

Ok, you need to fix this: this sounds to me like there are two insns:
TDCALL and TDVMCALL. But there's only TDCALL.

And I'm not even clear on how the differentiation is done - I guess
with %r11 which contains the VMCALL subfunction number in the
__tdx_hypercall() case but I'm not sure.

And when explaining this, pls put it in the comment over the function so
that it is clear how the distinction is made.

> Also define an additional wrapper _tdx_hypercall(), which adds error
> handling support for the TDCALL failure.
> 
> The __tdx_module_call() and __tdx_hypercall() helper functions are
> implemented in assembly in a .S file.  The TDCALL ABI requires
> shuffling arguments in and out of registers, which proved to be
> awkward with inline assembly.
> 
> Just like syscalls, not all TDVMCALL use cases need to use the same
> number of argument registers. The implementation here picks the current
> worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
> than 4 arguments, there will end up being a few superfluous (cheap)
> instructions. But, this approach maximizes code reuse.
> 
> For registers used by the TDCALL instruction, please check TDX GHCI
> specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
> Interface".
> 
> Based on previous patch by Sean Christopherson.
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/coco/Makefile        |   2 +-
>  arch/x86/coco/tdcall.S        | 188 ++++++++++++++++++++++++++++++++++
>  arch/x86/coco/tdx.c           |  18 ++++

Those should be

arch/x86/coco/tdx/tdcall.S
arch/x86/coco/tdx/tdx.c

like we said:

"- confidential computing guest stuff: arch/x86/coco/{sev,tdx}"

>  arch/x86/include/asm/tdx.h    |  27 +++++
>  arch/x86/kernel/asm-offsets.c |  10 ++
>  5 files changed, 244 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/coco/tdcall.S

...

> +SYM_FUNC_START(__tdx_hypercall)
> +	FRAME_BEGIN
> +
> +	/* Save callee-saved GPRs as mandated by the x86_64 ABI */
> +	push %r15
> +	push %r14
> +	push %r13
> +	push %r12
> +
> +	/* Mangle function call ABI into TDCALL ABI: */
> +	/* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */
> +	xor %eax, %eax
> +
> +	/* Copy hypercall registers from arg struct: */
> +	movq TDX_HYPERCALL_r10(%rdi), %r10
> +	movq TDX_HYPERCALL_r11(%rdi), %r11
> +	movq TDX_HYPERCALL_r12(%rdi), %r12
> +	movq TDX_HYPERCALL_r13(%rdi), %r13
> +	movq TDX_HYPERCALL_r14(%rdi), %r14
> +	movq TDX_HYPERCALL_r15(%rdi), %r15
> +
> +	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
> +
> +	tdcall
> +
> +	/*
> +	 * RAX==0 indicates a failure of the TDVMCALL mechanism itself and that
> +	 * something has gone horribly wrong with the TDX module.
> +	 *
> +	 * The return status of the hypercall operation is in a separate
> +	 * register (in R10). Hypercall errors are a part of normal operation
> +	 * and are handled by callers.
> +	 */
> +	testq %rax, %rax
> +	jne .Lpanic

Hm, can this call a C function which does the panic so that a proper
error message is dumped to the user so that at least she knows where the
panic comes from?

> +
> +	/* TDVMCALL leaf return code is in R10 */
> +	movq %r10, %rax
> +
> +	/* Copy hypercall result registers to arg struct if needed */
> +	testq $TDX_HCALL_HAS_OUTPUT, %rsi
> +	jz .Lout
> +
> +	movq %r10, TDX_HYPERCALL_r10(%rdi)
> +	movq %r11, TDX_HYPERCALL_r11(%rdi)
> +	movq %r12, TDX_HYPERCALL_r12(%rdi)
> +	movq %r13, TDX_HYPERCALL_r13(%rdi)
> +	movq %r14, TDX_HYPERCALL_r14(%rdi)
> +	movq %r15, TDX_HYPERCALL_r15(%rdi)
> +.Lout:
> +	/*
> +	 * Zero out registers exposed to the VMM to avoid speculative execution
> +	 * with VMM-controlled values. This needs to include all registers
> +	 * present in TDVMCALL_EXPOSE_REGS_MASK (except R12-R15). R12-R15
> +	 * context will be restored.
> +	 */
> +	xor %r10d, %r10d
> +	xor %r11d, %r11d
> +
> +	/* Restore callee-saved GPRs as mandated by the x86_64 ABI */
> +	pop %r12
> +	pop %r13
> +	pop %r14
> +	pop %r15
> +
> +	FRAME_END
> +
> +	retq
> +.Lpanic:
> +	ud2
> +SYM_FUNC_END(__tdx_hypercall)

...

> diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
> index 7dca52f5cfc6..0b465e7d0a2f 100644
> --- a/arch/x86/kernel/asm-offsets.c
> +++ b/arch/x86/kernel/asm-offsets.c
> @@ -74,6 +74,16 @@ static void __used common(void)
>  	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
>  	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
>  
> +#ifdef CONFIG_INTEL_TDX_GUEST

Those have ifdeffery around them - why don't the TDX_MODULE_* ones need
it too?

> +	BLANK();
> +	OFFSET(TDX_HYPERCALL_r10, tdx_hypercall_args, r10);
> +	OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_args, r11);
> +	OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_args, r12);
> +	OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_args, r13);
> +	OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_args, r14);
> +	OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_args, r15);
> +#endif
> +

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO
  2022-03-10  1:06       ` Dave Hansen
@ 2022-03-10 16:48         ` Kirill A. Shutemov
  2022-03-10 17:53           ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-10 16:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: tglx, mingo, bp, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, thomas.lendacky, brijesh.singh, x86, linux-kernel

On Wed, Mar 09, 2022 at 05:06:11PM -0800, Dave Hansen wrote:
> On 3/9/22 16:51, Kirill A. Shutemov wrote:
> > On Tue, Mar 08, 2022 at 01:26:28PM -0800, Dave Hansen wrote:
> >> Memory encryption has zero to do with this.  The TDX isolation
> >> mechanisms are totally discrete from memory encryption, although they
> >> are "neighbors" of sorts.
> > 
> > Hm. I don't see why you say encryption is not relevant. VMM (host kernel)
> > has ultimate access to guest memory cypher text. It can read it as
> > cypher text without any issue (using KeyID-0).
> > 
> > Could you elaborate on the point?
> 
> I think you're just confusing what TDX has with MKTME.  The whitepaper says:
> 
> > The TD-bit associated with the line in memory seeks to
> > detect software or devices attempting to read memory
> > encrypted with private KeyID, using a shared KeyID, to reveal
> > the ciphertext. On such accesses, the MKTME returns a fixed
> > pattern to prevent ciphertext analysis.
> 
> I think several firstborn were sacrificed to get that bit.  Let's not
> forget why we have it. :)

Okay, I missed the memo. I will drop reference to encryption:

  - The CPU disallows software other than the TDX module and TDs from
    making memory accesses using the private key. Without the correct
    key VMM has no way to access TD-private memory.

> >>> Rather than touching the entire kernel, it might also be possible to
> >>> just go after drivers that use MMIO in TDX guests.  Right now, that's
> >>> limited only to virtio and some x86-specific drivers.
> >>>
> >>> All virtio MMIO appears to be done through a single function, which
> >>> makes virtio eminently easy to patch.
> >>>
> >>> This approach will be adopted in the future, removing the bulk of
> >>> MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
> >>
> >> This still doesn't *quite* do it for me for a justification.  Why can't
> >> the non-virtio cases be converted as well?  Why doesn't the "patching
> >> MMIO sites" work for x86 code too?
> >>
> >> You really need to convince us that *this* approach will be required
> >> forever.
> > 
> > What if I add:
> > 
> > 	Many drivers can potentially be used inside TDX guest (e.g. via device
> > 	passthough or random device emulation by VMM), but very few will.
> > 	Patching every possible driver is not practical. #VE-based MMIO provides
> > 	functionality for everybody. Performance-critical cases can be optimized
> > 	as needed.
> 
> This problem was laid out as having three cases:
> 1. virtio
> 2. x86-specific drivers
> 3. random drivers (everything else)
> 
> #1 could be done with paravirt
> #2 is unspecified and unknown
> #3 use doesn't as far as I know exist in TDX guests today

#2 doesn't matter from performance point of view and there is no
convenient place where they can be intercepted as they are scattered
across the tree. Patching them doesn't bring any benefit, only pain.

#3 some customers already declared that they will use device passthough
(yes, it is not safe). CSP may want to emulate random device, depending on
setup. Like, a power button or something.

> > BUG() here makes it clear that the handler itself is buggy. Returning
> > false and kicking in #GP-like logic indicates that something wrong with
> > the code that triggered #VE. I think it is an important distinction.
> 
> OK, then how about a WARN_ON() which is followed by the #GP?

You folks give mixed messages. Thomas was very unhappy when I tried to add
code that recovers from WBINVD:

https://lore.kernel.org/all/87y22uujkm.ffs@tglx

It is exactly the same scenario: kernel code is buggy and has to be fixed.

So, what the policy?

> Let's say insn_decode_mmio() does something insane like:
> 
> 	return -EINVAL;
> 
> Should we really be killing the kernel for that?

Note that #GP is most likely kill kernel as well. We handle in-kernel
MMIO. There are no many chances for recover.

Is it really the big deal?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO
  2022-03-10 16:48         ` Kirill A. Shutemov
@ 2022-03-10 17:53           ` Dave Hansen
  2022-03-11 17:18             ` Kirill A. Shutemov
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2022-03-10 17:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, bp, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, thomas.lendacky, brijesh.singh, x86, linux-kernel

On 3/10/22 08:48, Kirill A. Shutemov wrote:
> On Wed, Mar 09, 2022 at 05:06:11PM -0800, Dave Hansen wrote:
>> On 3/9/22 16:51, Kirill A. Shutemov wrote:
>>> On Tue, Mar 08, 2022 at 01:26:28PM -0800, Dave Hansen wrote:
>>>> Memory encryption has zero to do with this.  The TDX isolation
>>>> mechanisms are totally discrete from memory encryption, although they
>>>> are "neighbors" of sorts.
>>>
>>> Hm. I don't see why you say encryption is not relevant. VMM (host kernel)
>>> has ultimate access to guest memory cypher text. It can read it as
>>> cypher text without any issue (using KeyID-0).
>>>
>>> Could you elaborate on the point?
>>
>> I think you're just confusing what TDX has with MKTME.  The whitepaper says:
>>
>>> The TD-bit associated with the line in memory seeks to
>>> detect software or devices attempting to read memory
>>> encrypted with private KeyID, using a shared KeyID, to reveal
>>> the ciphertext. On such accesses, the MKTME returns a fixed
>>> pattern to prevent ciphertext analysis.
>>
>> I think several firstborn were sacrificed to get that bit.  Let's not
>> forget why we have it. :)
> 
> Okay, I missed the memo. I will drop reference to encryption:
> 
>   - The CPU disallows software other than the TDX module and TDs from
>     making memory accesses using the private key. Without the correct
>     key VMM has no way to access TD-private memory.

I think this is good enough:

   - All guest code is expected to be in TD-private memory.  Being
     private to the TD, VMMs have no way to access TD-private memory and
     no way to read the instruction to decode and emulate it.

We don't have to rehash what private memory is or how it is implemented.

>> This problem was laid out as having three cases:
>> 1. virtio
>> 2. x86-specific drivers
>> 3. random drivers (everything else)
>>
>> #1 could be done with paravirt
>> #2 is unspecified and unknown
>> #3 use doesn't as far as I know exist in TDX guests today
> 
> #2 doesn't matter from performance point of view and there is no
> convenient place where they can be intercepted as they are scattered
> across the tree. Patching them doesn't bring any benefit, only pain.

I'd feel a lot better if this was slightly better specified.  Even
booting with a:

	printf("rip: %lx\n", regs->rip);

in the #VE handler would give some hard data about these.  This still
feels to me like something that Sean got booting two years ago and
nobody has really reconsidered.

> #3 some customers already declared that they will use device passthough
> (yes, it is not safe). CSP may want to emulate random device, depending on
> setup. Like, a power button or something.

I'm not sure I'm totally on board with that.

But, let's try to make a coherent changelog out of that mess.

	This approach is bad for performance.  But, it has (virtually)
	no impact on the size of the kernel image and will work for a
	wide variety of drivers.  This allows TDX deployments to use
	arbitrary devices and device drivers, including virtio.  TDX
	customers have asked for the capability to use random devices in
	their deployments.

	In other words, even if all of the work was done to
	paravirtualize all x86 MMIO users and virtio, this approach
	would still be needed.  There is essentially no way to get rid
	of this code.

	This approach is functional for all in-kernel MMIO users current
	and future and does so with a minimal amount of code and kernel
	image bloat.

Does that summarize it?

>>> BUG() here makes it clear that the handler itself is buggy. Returning
>>> false and kicking in #GP-like logic indicates that something wrong with
>>> the code that triggered #VE. I think it is an important distinction.
>>
>> OK, then how about a WARN_ON() which is followed by the #GP?
> 
> You folks give mixed messages. Thomas was very unhappy when I tried to add
> code that recovers from WBINVD:
> 
> https://lore.kernel.org/all/87y22uujkm.ffs@tglx
> 
> It is exactly the same scenario: kernel code is buggy and has to be fixed.
> 
> So, what the policy?

Lately, I've tried to subscribe to the "there is NO F*CKING EXCUSE to
knowingly kill the kernel" policy[1].

You don't add a BUG_ON() unless the kernel has no meaningful way to
continue.  It's not for a "hey, that's weird..." kind of thing.  Like,
"hey, the instruction decoder looks confused, that's weird."

>> Let's say insn_decode_mmio() does something insane like:
>>
>> 	return -EINVAL;
>>
>> Should we really be killing the kernel for that?
> 
> Note that #GP is most likely kill kernel as well. We handle in-kernel
> MMIO. There are no many chances for recover.
> 
> Is it really the big deal?

No, not really.

But, I'd like to see one of two things:
1. Change the BUG()s to WARN()s.
2. Make it utterly clear that handle_mmio() is for handling kernel MMIO
   only.  Definitely change the naming, possibly add a check for
   user_mode().  In other words, make it even _less_ generic.

None of that should be hard.

BTW, the BUG()s made me think about how the gp_try_fixup_and_notify()
code would work for MMIO.  For instance, are there any places where
fixup might be done for MMIO?  If so, an earlier BUG() wouldn't allow
the fixup to occur.

Do we *WANT* #VE's to be exposed to the #GP fixup machinery?

1.
https://lore.kernel.org/all/CA+55aFwyNTLuZgOWMTRuabWobF27ygskuxvFd-P0n-3UNT=0Og@mail.gmail.com/

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 03/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-03-10 15:30   ` Borislav Petkov
@ 2022-03-10 21:20     ` Kirill A. Shutemov
  2022-03-10 21:48       ` Kirill A. Shutemov
  2022-03-12 10:41       ` Borislav Petkov
  0 siblings, 2 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-10 21:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On Thu, Mar 10, 2022 at 04:30:57PM +0100, Borislav Petkov wrote:
> On Wed, Mar 02, 2022 at 05:27:39PM +0300, Kirill A. Shutemov wrote:
> > From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> > 
> > Guests communicate with VMMs with hypercalls. Historically, these
> > are implemented using instructions that are known to cause VMEXITs
> > like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
> > expose the guest state to the host. This prevents the old hypercall
> > mechanisms from working. So, to communicate with VMM, TDX
> > specification defines a new instruction called TDCALL.
> > 
> > In a TDX based VM, since the VMM is an untrusted entity, an intermediary
> > layer -- TDX module -- facilitates secure communication between the host
> > and the guest. TDX module is loaded like a firmware into a special CPU
> > mode called SEAM. TDX guests communicate with the TDX module using the
> > TDCALL instruction.
> > 
> > A guest uses TDCALL to communicate with both the TDX module and VMM.
> > The value of the RAX register when executing the TDCALL instruction is
> > used to determine the TDCALL type. A variant of TDCALL used to communicate
> > with the VMM is called TDVMCALL.
> > 
> > Add generic interfaces to communicate with the TDX module and VMM
> > (using the TDCALL instruction).
> > 
> > __tdx_hypercall()    - Used by the guest to request services from the
> > 		       VMM (via TDVMCALL).
> > __tdx_module_call()  - Used to communicate with the TDX module (via
> > 		       TDCALL).
> 
> Ok, you need to fix this: this sounds to me like there are two insns:
> TDCALL and TDVMCALL. But there's only TDCALL.
> 
> And I'm not even clear on how the differentiation is done - I guess
> with %r11 which contains the VMCALL subfunction number in the
> __tdx_hypercall() case but I'm not sure.

TDVMCALL is a leaf of TDCALL. The leaf number is encoded in RAX: RAX==0 is
TDVMCALL.

I'm not completely sure what has to be fixed. Make it clear that TDVMCALL
is leaf of TDCALL? Something like this:

	__tdx_module_call()  - Used to communicate with the TDX module (via
			       TDCALL instruction).
	__tdx_hypercall()    - Used by the guest to request services from the
			       VMM (via TDVMCALL leaf of TDCALL).

?

> And when explaining this, pls put it in the comment over the function so
> that it is clear how the distinction is made.

But it's already there:

/*
 * __tdx_module_call()  - Used by TDX guests to request services from
 * the TDX module (does not include VMM services).
 *
 * Transforms function call register arguments into the TDCALL
 * register ABI.  After TDCALL operation, TDX module output is saved
 * in @out (if it is provided by the user)
 *
 ...
 */

and

/*
 * __tdx_hypercall() - Make hypercalls to a TDX VMM.
 *
 * Transforms values in  function call argument struct tdx_hypercall_args @args
 * into the TDCALL register ABI. After TDCALL operation, VMM output is saved
 * back in @args.
 *
 *-------------------------------------------------------------------------
 * TD VMCALL ABI:
 *-------------------------------------------------------------------------
 *
 * Input Registers:
 *
 * RAX                 - TDCALL instruction leaf number (0 - TDG.VP.VMCALL)
 .....
 */

Hm?

> > Also define an additional wrapper _tdx_hypercall(), which adds error
> > handling support for the TDCALL failure.
> > 
> > The __tdx_module_call() and __tdx_hypercall() helper functions are
> > implemented in assembly in a .S file.  The TDCALL ABI requires
> > shuffling arguments in and out of registers, which proved to be
> > awkward with inline assembly.
> > 
> > Just like syscalls, not all TDVMCALL use cases need to use the same
> > number of argument registers. The implementation here picks the current
> > worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
> > than 4 arguments, there will end up being a few superfluous (cheap)
> > instructions. But, this approach maximizes code reuse.
> > 
> > For registers used by the TDCALL instruction, please check TDX GHCI
> > specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
> > Interface".
> > 
> > Based on previous patch by Sean Christopherson.
> > 
> > Reviewed-by: Tony Luck <tony.luck@intel.com>
> > Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/coco/Makefile        |   2 +-
> >  arch/x86/coco/tdcall.S        | 188 ++++++++++++++++++++++++++++++++++
> >  arch/x86/coco/tdx.c           |  18 ++++
> 
> Those should be
> 
> arch/x86/coco/tdx/tdcall.S
> arch/x86/coco/tdx/tdx.c
> 
> like we said:
> 
> "- confidential computing guest stuff: arch/x86/coco/{sev,tdx}"

Okay, will change. But it is not what we agreed about before:

https://lore.kernel.org/all/Yg5q742GsjCRHXZL@zn.tnic

> >  arch/x86/include/asm/tdx.h    |  27 +++++
> >  arch/x86/kernel/asm-offsets.c |  10 ++
> >  5 files changed, 244 insertions(+), 1 deletion(-)
> >  create mode 100644 arch/x86/coco/tdcall.S
> 
> ...
> 
> > +SYM_FUNC_START(__tdx_hypercall)
> > +	FRAME_BEGIN
> > +
> > +	/* Save callee-saved GPRs as mandated by the x86_64 ABI */
> > +	push %r15
> > +	push %r14
> > +	push %r13
> > +	push %r12
> > +
> > +	/* Mangle function call ABI into TDCALL ABI: */
> > +	/* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */
> > +	xor %eax, %eax
> > +
> > +	/* Copy hypercall registers from arg struct: */
> > +	movq TDX_HYPERCALL_r10(%rdi), %r10
> > +	movq TDX_HYPERCALL_r11(%rdi), %r11
> > +	movq TDX_HYPERCALL_r12(%rdi), %r12
> > +	movq TDX_HYPERCALL_r13(%rdi), %r13
> > +	movq TDX_HYPERCALL_r14(%rdi), %r14
> > +	movq TDX_HYPERCALL_r15(%rdi), %r15
> > +
> > +	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
> > +
> > +	tdcall
> > +
> > +	/*
> > +	 * RAX==0 indicates a failure of the TDVMCALL mechanism itself and that
> > +	 * something has gone horribly wrong with the TDX module.
> > +	 *
> > +	 * The return status of the hypercall operation is in a separate
> > +	 * register (in R10). Hypercall errors are a part of normal operation
> > +	 * and are handled by callers.
> > +	 */
> > +	testq %rax, %rax
> > +	jne .Lpanic
> 
> Hm, can this call a C function which does the panic so that a proper
> error message is dumped to the user so that at least she knows where the
> panic comes from?

Sure we can. But it would look somewhat clunky.
Wouldn't backtrace be enough for this (never to be seen) case?

So far the panic would look like this:

	invalid opcode: 0000 [#1] PREEMPT SMP
	CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.17.0-rc5-00181-geb9d1dde679a-dirty #1883
	RIP: 0010:__tdx_hypercall+0x6d/0x70
	Code: 00 00 74 17 4c 89 17 4c 89 5f 08 4c 89 67 10 4c 89 6f 18 4c 89 77 20 4c 89 7f 28 45 31 d2 45 31 db 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 00 55 53 41 54 41 55 41 56 41 57 48 89 a7 98 1b 00 00 48 8b
	RSP: 0000:ffffffff82803e80 EFLAGS: 00010002
	RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000fc00
	RDX: ff11000004c3d448 RSI: 0000000000000002 RDI: ffffffff82803ea8
	RBP: ffffffff8283e840 R08: 0000000000000000 R09: 000000005eee98b1
	R10: 0000000000000000 R11: 000000000000000c R12: 0000000000000000
	R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
	FS:  0000000000000000(0000) GS:ff1100001c400000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: ff1100001c9ff000 CR3: 0000000002834001 CR4: 0000000000771ef0
	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
	DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
	PKRU: 55555554
	Call Trace:
	 <TASK>
	 tdx_safe_halt+0x46/0x80
	 default_idle_call+0x4f/0x90
	 do_idle+0xeb/0x250
	 cpu_startup_entry+0x15/0x20
	 start_kernel+0x372/0x3d1
	 secondary_startup_64_no_verify+0xe5/0xeb
	 </TASK>

To me "invalid opcode" at "RIP: 0010:__tdx_hypercall+0x6d/0x70" is pretty
clear where it comes from, no?

> > +
> > +	/* TDVMCALL leaf return code is in R10 */
> > +	movq %r10, %rax
> > +
> > +	/* Copy hypercall result registers to arg struct if needed */
> > +	testq $TDX_HCALL_HAS_OUTPUT, %rsi
> > +	jz .Lout
> > +
> > +	movq %r10, TDX_HYPERCALL_r10(%rdi)
> > +	movq %r11, TDX_HYPERCALL_r11(%rdi)
> > +	movq %r12, TDX_HYPERCALL_r12(%rdi)
> > +	movq %r13, TDX_HYPERCALL_r13(%rdi)
> > +	movq %r14, TDX_HYPERCALL_r14(%rdi)
> > +	movq %r15, TDX_HYPERCALL_r15(%rdi)
> > +.Lout:
> > +	/*
> > +	 * Zero out registers exposed to the VMM to avoid speculative execution
> > +	 * with VMM-controlled values. This needs to include all registers
> > +	 * present in TDVMCALL_EXPOSE_REGS_MASK (except R12-R15). R12-R15
> > +	 * context will be restored.
> > +	 */
> > +	xor %r10d, %r10d
> > +	xor %r11d, %r11d
> > +
> > +	/* Restore callee-saved GPRs as mandated by the x86_64 ABI */
> > +	pop %r12
> > +	pop %r13
> > +	pop %r14
> > +	pop %r15
> > +
> > +	FRAME_END
> > +
> > +	retq
> > +.Lpanic:
> > +	ud2
> > +SYM_FUNC_END(__tdx_hypercall)
> 
> ...
> 
> > diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
> > index 7dca52f5cfc6..0b465e7d0a2f 100644
> > --- a/arch/x86/kernel/asm-offsets.c
> > +++ b/arch/x86/kernel/asm-offsets.c
> > @@ -74,6 +74,16 @@ static void __used common(void)
> >  	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
> >  	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
> >  
> > +#ifdef CONFIG_INTEL_TDX_GUEST
> 
> Those have ifdeffery around them - why don't the TDX_MODULE_* ones need
> it too?

I will drop the #ifdef. There's no harm in generating it for !TDX build.

> > +	BLANK();
> > +	OFFSET(TDX_HYPERCALL_r10, tdx_hypercall_args, r10);
> > +	OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_args, r11);
> > +	OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_args, r12);
> > +	OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_args, r13);
> > +	OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_args, r14);
> > +	OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_args, r15);
> > +#endif
> > +
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 03/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-03-10 21:20     ` Kirill A. Shutemov
@ 2022-03-10 21:48       ` Kirill A. Shutemov
  2022-03-15 15:56         ` Borislav Petkov
  2022-03-12 10:41       ` Borislav Petkov
  1 sibling, 1 reply; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-10 21:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On Fri, Mar 11, 2022 at 12:20:59AM +0300, Kirill A. Shutemov wrote:
> > > +	/*
> > > +	 * RAX==0 indicates a failure of the TDVMCALL mechanism itself and that
> > > +	 * something has gone horribly wrong with the TDX module.
> > > +	 *
> > > +	 * The return status of the hypercall operation is in a separate
> > > +	 * register (in R10). Hypercall errors are a part of normal operation
> > > +	 * and are handled by callers.
> > > +	 */
> > > +	testq %rax, %rax
> > > +	jne .Lpanic
> > 
> > Hm, can this call a C function which does the panic so that a proper
> > error message is dumped to the user so that at least she knows where the
> > panic comes from?
> 
> Sure we can. But it would look somewhat clunky.

Here how it can look like. Is it what you want?

diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index f00fd3a39b64..b26eab2c3c59 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -3,6 +3,7 @@
 #include "../cpuflags.h"
 #include "../string.h"
 #include "../io.h"
+#include "error.h"

 #include <vdso/limits.h>
 #include <uapi/asm/vmx.h>
@@ -16,6 +17,11 @@ bool early_is_tdx_guest(void)
 	return tdx_guest_detected;
 }

+void __tdx_hypercall_failed(void)
+{
+	error("TDVMCALL failed. TDX module bug?");
+}
+
 static inline unsigned int tdx_io_in(int size, u16 port)
 {
 	struct tdx_hypercall_args args = {
diff --git a/arch/x86/coco/tdx/tdcall.S b/arch/x86/coco/tdx/tdcall.S
index 22832f19df2c..f39de4b01a9c 100644
--- a/arch/x86/coco/tdx/tdcall.S
+++ b/arch/x86/coco/tdx/tdcall.S
@@ -197,5 +197,5 @@ SYM_FUNC_START(__tdx_hypercall)

 	retq
 .Lpanic:
-	ud2
+	call __tdx_hypercall_failed
 SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 8e19694d33e2..29fc5941b80c 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -53,6 +53,11 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
 	return __tdx_hypercall(&args, 0);
 }

+void __tdx_hypercall_failed(void)
+{
+	panic("TDVMCALL failed. TDX module bug?");
+}
+
 /*
  * The TDG.VP.VMCALL-Instruction-execution sub-functions are defined
  * independently from but are currently matched 1:1 with VMX EXIT_REASONs.
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO
  2022-03-10 17:53           ` Dave Hansen
@ 2022-03-11 17:18             ` Kirill A. Shutemov
  2022-03-11 17:22               ` Dave Hansen
  2022-03-11 18:01               ` Dave Hansen
  0 siblings, 2 replies; 84+ messages in thread
From: Kirill A. Shutemov @ 2022-03-11 17:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: tglx, mingo, bp, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, thomas.lendacky, brijesh.singh, x86, linux-kernel

On Thu, Mar 10, 2022 at 09:53:01AM -0800, Dave Hansen wrote:
> On 3/10/22 08:48, Kirill A. Shutemov wrote:
> I think this is good enough:
> 
>    - All guest code is expected to be in TD-private memory.  Being
>      private to the TD, VMMs have no way to access TD-private memory and
>      no way to read the instruction to decode and emulate it.

Looks good.

One remark: executing from shared memory (or walking page tables in shared
memory) triggers #PF.

> 
> We don't have to rehash what private memory is or how it is implemented.
> 
> >> This problem was laid out as having three cases:
> >> 1. virtio
> >> 2. x86-specific drivers
> >> 3. random drivers (everything else)
> >>
> >> #1 could be done with paravirt
> >> #2 is unspecified and unknown
> >> #3 use doesn't as far as I know exist in TDX guests today
> > 
> > #2 doesn't matter from performance point of view and there is no
> > convenient place where they can be intercepted as they are scattered
> > across the tree. Patching them doesn't bring any benefit, only pain.
> 
> I'd feel a lot better if this was slightly better specified.  Even
> booting with a:
> 
> 	printf("rip: %lx\n", regs->rip);
> 
> in the #VE handler would give some hard data about these.  This still
> feels to me like something that Sean got booting two years ago and
> nobody has really reconsidered.

Here the list I see on boot. It highly depends on QEMU setup. Any form of
device filtering will cut the further.

MMIO: ahci_enable_ahci
MMIO: ahci_freeze
MMIO: ahci_init_controller
MMIO: ahci_port_resume
MMIO: ahci_postreset
MMIO: ahci_reset_controller
MMIO: ahci_save_initial_config
MMIO: ahci_scr_read
MMIO: ahci_scr_write
MMIO: ahci_set_em_messages
MMIO: ahci_start_engine
MMIO: ahci_stop_engine
MMIO: ahci_thaw
MMIO: ioapic_set_affinity
MMIO: ioread16
MMIO: ioread32
MMIO: ioread8
MMIO: iowrite16
MMIO: iowrite32
MMIO: iowrite8
MMIO: mask_ioapic_irq
MMIO: mp_irqdomain_activate
MMIO: mp_irqdomain_deactivate
MMIO: native_io_apic_read
MMIO: __pci_enable_msix_range
MMIO: pci_mmcfg_read
MMIO: pci_msi_mask_irq
MMIO: pci_msi_unmask_irq
MMIO: __pci_write_msi_msg
MMIO: restore_ioapic_entries
MMIO: startup_ioapic_irq
MMIO: update_no_reboot_bit_mem

ioread*/iowrite* comes from virtio.

> > #3 some customers already declared that they will use device passthough
> > (yes, it is not safe). CSP may want to emulate random device, depending on
> > setup. Like, a power button or something.
> 
> I'm not sure I'm totally on board with that.
> 
> But, let's try to make a coherent changelog out of that mess.
> 
> 	This approach is bad for performance.  But, it has (virtually)
> 	no impact on the size of the kernel image and will work for a
> 	wide variety of drivers.  This allows TDX deployments to use
> 	arbitrary devices and device drivers, including virtio.  TDX
> 	customers have asked for the capability to use random devices in
> 	their deployments.
> 
> 	In other words, even if all of the work was done to
> 	paravirtualize all x86 MMIO users and virtio, this approach
> 	would still be needed.  There is essentially no way to get rid
> 	of this code.
> 
> 	This approach is functional for all in-kernel MMIO users current
> 	and future and does so with a minimal amount of code and kernel
> 	image bloat.
> 
> Does that summarize it?

I will integrate it in the commit message.

> But, I'd like to see one of two things:
> 1. Change the BUG()s to WARN()s.
> 2. Make it utterly clear that handle_mmio() is for handling kernel MMIO
>    only.  Definitely change the naming, possibly add a check for
>    user_mode().  In other words, make it even _less_ generic.
> 
> None of that should be hard.

Okay, I will downgrade BUG() to WARN() and return false for user_mode()
with warning.

> BTW, the BUG()s made me think about how the gp_try_fixup_and_notify()
> code would work for MMIO.  For instance, are there any places where
> fixup might be done for MMIO?  If so, an earlier BUG() wouldn't allow
> the fixup to occur.

I can be wrong, but I don't think we do fixups for MMIO.

> Do we *WANT* #VE's to be exposed to the #GP fixup machinery?

We need the fixup at least for MSRs.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO
  2022-03-11 17:18             ` Kirill A. Shutemov
@ 2022-03-11 17:22               ` Dave Hansen
  2022-03-11 18:01               ` Dave Hansen
  1 sibling, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2022-03-11 17:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, bp, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, thomas.lendacky, brijesh.singh, x86, linux-kernel

On 3/11/22 09:18, Kirill A. Shutemov wrote:
>> Do we *WANT* #VE's to be exposed to the #GP fixup machinery?
> We need the fixup at least for MSRs.

Could you mention that, along with the implications for the other #VE's
in the MSR patch changelog?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO
  2022-03-11 17:18             ` Kirill A. Shutemov
  2022-03-11 17:22               ` Dave Hansen
@ 2022-03-11 18:01               ` Dave Hansen
  1 sibling, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2022-03-11 18:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, bp, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, thomas.lendacky, brijesh.singh, x86, linux-kernel

On 3/11/22 09:18, Kirill A. Shutemov wrote:
> On Thu, Mar 10, 2022 at 09:53:01AM -0800, Dave Hansen wrote:
>> On 3/10/22 08:48, Kirill A. Shutemov wrote:
>> I think this is good enough:
>>
>>    - All guest code is expected to be in TD-private memory.  Being
>>      private to the TD, VMMs have no way to access TD-private memory and
>>      no way to read the instruction to decode and emulate it.
> 
> Looks good.
> 
> One remark: executing from shared memory (or walking page tables in shared
> memory) triggers #PF.

Good point.  I thought that little nugget was AMD only.  Thanks for the
reminder.

   - TDX does not allow guests to execute from shared memory.  All
     executed instructions are in TD-private memory.  Being private to
     the TD, VMMs have no way to access TD-private memory and no way to
     read the instruction to decode and emulate it.

...
>> I'd feel a lot better if this was slightly better specified.  Even
>> booting with a:
>>
>> 	printf("rip: %lx\n", regs->rip);
>>
>> in the #VE handler would give some hard data about these.  This still
>> feels to me like something that Sean got booting two years ago and
>> nobody has really reconsidered.
> 
> Here the list I see on boot. It highly depends on QEMU setup. Any form of
> device filtering will cut the further.
> 
> MMIO: ahci_enable_ahci
> MMIO: ahci_freeze
> MMIO: ahci_init_controller
> MMIO: ahci_port_resume
> MMIO: ahci_postreset
> MMIO: ahci_reset_controller
> MMIO: ahci_save_initial_config
> MMIO: ahci_scr_read
> MMIO: ahci_scr_write
> MMIO: ahci_set_em_messages
> MMIO: ahci_start_engine
> MMIO: ahci_stop_engine
> MMIO: ahci_thaw

OK, so this is one of the random drivers that will probably be replaced
with virtio in practice in real TD guests.

> MMIO: ioapic_set_affinity
> MMIO: ioread16
> MMIO: ioread32
> MMIO: ioread8
> MMIO: iowrite16
> MMIO: iowrite32
> MMIO: iowrite8
> MMIO: mask_ioapic_irq
> MMIO: mp_irqdomain_activate
> MMIO: mp_irqdomain_deactivate
> MMIO: native_io_apic_read
> MMIO: __pci_enable_msix_range
> MMIO: pci_mmcfg_read
> MMIO: pci_msi_mask_irq
> MMIO: pci_msi_unmask_irq
> MMIO: __pci_write_msi_msg
> MMIO: restore_ioapic_entries
> MMIO: startup_ioapic_irq
> MMIO: update_no_reboot_bit_mem
> 
> ioread*/iowrite* comes from virtio.

I think we *actually* have some real facts to go on now instead of just
random hand waving about unknown MMIO in the the depths of arch/x86.
Thanks for doing that.

Just spot-checking these, I think all of these end up going through some
->ops function pointers:

	irq_chip
	x86_apic_ops
	pci_raw_ops

That doesn't really help your case, though.  Presumably, with some
amount of work, we could paravirtualize these users.  The list of things
doesn't seem very large at all.  But, each of those things will need
TDX-specific code.

So, do we go patching those things with TDX-specific code?  Or do we
just do this one, universal, slow #VE thing for now?

Kirill, I know what camp you are in. :)

Looking at the *actual* work that would be required, I'm a little more
on the fence and leaning toward being ok with the universal #VE.

Does anybody feel differently?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 03/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-03-10 21:20     ` Kirill A. Shutemov
  2022-03-10 21:48       ` Kirill A. Shutemov
@ 2022-03-12 10:41       ` Borislav Petkov
  1 sibling, 0 replies; 84+ messages in thread
From: Borislav Petkov @ 2022-03-12 10:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On Fri, Mar 11, 2022 at 12:20:59AM +0300, Kirill A. Shutemov wrote:
> > > A guest uses TDCALL to communicate with both the TDX module and VMM.
> > > The value of the RAX register when executing the TDCALL instruction is
> > > used to determine the TDCALL type. A variant of TDCALL used to communicate
> > > with the VMM is called TDVMCALL.
> > > 
> > > Add generic interfaces to communicate with the TDX module and VMM
> > > (using the TDCALL instruction).
> > > 
> > > __tdx_hypercall()    - Used by the guest to request services from the
> > > 		       VMM (via TDVMCALL).
> > > __tdx_module_call()  - Used to communicate with the TDX module (via
> > > 		       TDCALL).
> > 
> > Ok, you need to fix this: this sounds to me like there are two insns:
> > TDCALL and TDVMCALL. But there's only TDCALL.
> > 
> > And I'm not even clear on how the differentiation is done - I guess
> > with %r11 which contains the VMCALL subfunction number in the
> > __tdx_hypercall() case but I'm not sure.
> 
> TDVMCALL is a leaf of TDCALL. The leaf number is encoded in RAX: RAX==0 is
> TDVMCALL.
> 
> I'm not completely sure what has to be fixed. Make it clear that TDVMCALL
> is leaf of TDCALL? Something like this:
> 
> 	__tdx_module_call()  - Used to communicate with the TDX module (via
> 			       TDCALL instruction).
> 	__tdx_hypercall()    - Used by the guest to request services from the
> 			       VMM (via TDVMCALL leaf of TDCALL).

Yes, it says above "via TDVMCALL" and "A variant of TDCALL used to
communicate with the VMM is called TDVMCALL." and that is ambiguous as
to whether this is about two instructions or one and a modification of
the same.

We write insn mnemonics with all caps so I see "TDCALL" and go, ah ok,
that's the insn but then when I see "TDVMCALL" I don't know what that
is. Another insn? Maybe, it is in all caps too...

> > And when explaining this, pls put it in the comment over the function so
> > that it is clear how the distinction is made.
> 
> But it's already there:

No not that - the explantion you wrote above that TDVMCALL is a leaf of
TDCALL. That needs to be there explicitly so that there's no confusion.

> Okay, will change. But it is not what we agreed about before:
> 
> https://lore.kernel.org/all/Yg5q742GsjCRHXZL@zn.tnic

I must've been confused from doing so many things at the same time,
sorry about that. :-\

> To me "invalid opcode" at "RIP: 0010:__tdx_hypercall+0x6d/0x70" is pretty
> clear where it comes from, no?

Probably to you and a couple more people who know how to read oops
messages. You have to think about our common users too.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5 03/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-03-10 21:48       ` Kirill A. Shutemov
@ 2022-03-15 15:56         ` Borislav Petkov
  0 siblings, 0 replies; 84+ messages in thread
From: Borislav Petkov @ 2022-03-15 15:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, thomas.lendacky,
	brijesh.singh, x86, linux-kernel

On Fri, Mar 11, 2022 at 12:48:28AM +0300, Kirill A. Shutemov wrote:
> Here how it can look like. Is it what you want?

Yap, that's better.

> diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
> index f00fd3a39b64..b26eab2c3c59 100644
> --- a/arch/x86/boot/compressed/tdx.c
> +++ b/arch/x86/boot/compressed/tdx.c
> @@ -3,6 +3,7 @@
>  #include "../cpuflags.h"
>  #include "../string.h"
>  #include "../io.h"
> +#include "error.h"
> 
>  #include <vdso/limits.h>
>  #include <uapi/asm/vmx.h>
> @@ -16,6 +17,11 @@ bool early_is_tdx_guest(void)
>  	return tdx_guest_detected;
>  }
> 
> +void __tdx_hypercall_failed(void)
> +{
> +	error("TDVMCALL failed. TDX module bug?");
> +}
> +
>  static inline unsigned int tdx_io_in(int size, u16 port)
>  {
>  	struct tdx_hypercall_args args = {


> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 8e19694d33e2..29fc5941b80c 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -53,6 +53,11 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
>  	return __tdx_hypercall(&args, 0);
>  }
> 
> +void __tdx_hypercall_failed(void)
> +{
> +	panic("TDVMCALL failed. TDX module bug?");
> +}

Btw, if there's going to be more code duplication in TDX-land, I'd
suggest doing a shared file like

arch/x86/kernel/sev-shared.c

which you can include in both kernel stages.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCHv5.2 04/30] x86/tdx: Extend the confidential computing API to support TDX guests
  2022-03-09 23:51         ` [PATCHv5.2 " Kirill A. Shutemov
  2022-03-10  0:07           ` Dave Hansen
@ 2022-03-15 19:41           ` Borislav Petkov
  1 sibling, 0 replies; 84+ messages in thread
From: Borislav Petkov @ 2022-03-15 19:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: dave.hansen, aarcange, ak, brijesh.singh, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, linux-kernel,
	luto, mingo, pbonzini, peterz, sathyanarayanan.kuppuswamy, sdeep,
	seanjc, tglx, thomas.lendacky, tony.luck, vkuznets, wanpengli,
	x86

On Thu, Mar 10, 2022 at 02:51:21AM +0300, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> index fc1365dd927e..6529db059938 100644
> --- a/arch/x86/coco/core.c
> +++ b/arch/x86/coco/core.c
> @@ -87,9 +87,18 @@ EXPORT_SYMBOL_GPL(cc_platform_has);
>  
>  u64 cc_mkenc(u64 val)
>  {
> +	/*
> +	 * Both AMD and Intel use a bit in page table to indicate encryption

"... a bit in the page table ..."

> +	 * status of the page.
> +	 *
> +	 * - for AMD, bit *set* means the page is encrypted
> +	 * - for Intel *clear* means encrypted.
> +	 */
>  	switch (vendor) {
>  	case CC_VENDOR_AMD:
>  		return val | cc_mask;
> +	case CC_VENDOR_INTEL:
> +		return val & ~cc_mask;
>  	default:
>  		return val;
>  	}

With that fixed:

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2022-03-15 19:43 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-02 14:27 [PATCHv5 00/30] TDX Guest: TDX core support Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 01/30] x86/tdx: Detect running as a TDX guest in early boot Kirill A. Shutemov
2022-03-04 15:43   ` Borislav Petkov
2022-03-04 15:47     ` Dave Hansen
2022-03-04 16:02       ` Borislav Petkov
2022-03-07 22:24         ` [PATCHv5.1 " Kirill A. Shutemov
2022-03-09 18:22           ` Borislav Petkov
2022-03-02 14:27 ` [PATCHv5 02/30] x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers Kirill A. Shutemov
2022-03-08 19:56   ` Dave Hansen
2022-03-10 12:32   ` Borislav Petkov
2022-03-10 14:44     ` Kirill A. Shutemov
2022-03-10 14:51       ` Borislav Petkov
2022-03-02 14:27 ` [PATCHv5 03/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions Kirill A. Shutemov
2022-03-08 20:03   ` Dave Hansen
2022-03-10 15:30   ` Borislav Petkov
2022-03-10 21:20     ` Kirill A. Shutemov
2022-03-10 21:48       ` Kirill A. Shutemov
2022-03-15 15:56         ` Borislav Petkov
2022-03-12 10:41       ` Borislav Petkov
2022-03-02 14:27 ` [PATCHv5 04/30] x86/tdx: Extend the confidential computing API to support TDX guests Kirill A. Shutemov
2022-03-08 20:17   ` Dave Hansen
2022-03-09 16:01     ` [PATCHv5.1 " Kirill A. Shutemov
2022-03-09 18:36       ` Dave Hansen
2022-03-09 23:51         ` [PATCHv5.2 " Kirill A. Shutemov
2022-03-10  0:07           ` Dave Hansen
2022-03-15 19:41           ` Borislav Petkov
2022-03-02 14:27 ` [PATCHv5 05/30] x86/tdx: Exclude shared bit from __PHYSICAL_MASK Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 06/30] x86/traps: Refactor exc_general_protection() Kirill A. Shutemov
2022-03-08 20:18   ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 07/30] x86/traps: Add #VE support for TDX guest Kirill A. Shutemov
2022-03-08 20:29   ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 08/30] x86/tdx: Add HLT support for TDX guests Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 09/30] x86/tdx: Add MSR " Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 10/30] x86/tdx: Handle CPUID via #VE Kirill A. Shutemov
2022-03-08 20:33   ` Dave Hansen
2022-03-09 16:15     ` [PATCH] " Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO Kirill A. Shutemov
2022-03-08 21:26   ` Dave Hansen
2022-03-10  0:51     ` Kirill A. Shutemov
2022-03-10  1:06       ` Dave Hansen
2022-03-10 16:48         ` Kirill A. Shutemov
2022-03-10 17:53           ` Dave Hansen
2022-03-11 17:18             ` Kirill A. Shutemov
2022-03-11 17:22               ` Dave Hansen
2022-03-11 18:01               ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 12/30] x86/tdx: Detect TDX at early kernel decompression time Kirill A. Shutemov
2022-03-07 22:27   ` [PATCHv5.1 " Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 13/30] x86: Adjust types used in port I/O helpers Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 14/30] x86: Consolidate " Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 15/30] x86/boot: Port I/O: allow to hook up alternative helpers Kirill A. Shutemov
2022-03-02 17:42   ` Josh Poimboeuf
2022-03-02 19:41     ` Dave Hansen
2022-03-02 20:02       ` Josh Poimboeuf
2022-03-02 14:27 ` [PATCHv5 16/30] x86/boot: Port I/O: add decompression-time support for TDX Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 17/30] x86/tdx: Port I/O: add runtime hypercalls Kirill A. Shutemov
2022-03-08 21:30   ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 18/30] x86/tdx: Port I/O: add early boot support Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 19/30] x86/tdx: Wire up KVM hypercalls Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 20/30] x86/boot: Add a trampoline for booting APs via firmware handoff Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 21/30] x86/acpi, x86/boot: Add multiprocessor wake-up support Kirill A. Shutemov
2022-03-02 14:27 ` [PATCHv5 22/30] x86/boot: Set CR0.NE early and keep it set during the boot Kirill A. Shutemov
2022-03-08 21:37   ` Dave Hansen
2022-03-02 14:27 ` [PATCHv5 23/30] x86/boot: Avoid #VE during boot for TDX platforms Kirill A. Shutemov
2022-03-07  9:29   ` Xiaoyao Li
2022-03-07 22:33     ` Kirill A. Shutemov
2022-03-08  1:19       ` Xiaoyao Li
2022-03-08 16:41         ` Kirill A. Shutemov
2022-03-07 22:36     ` [PATCHv5.1 " Kirill A. Shutemov
2022-03-02 14:28 ` [PATCHv5 24/30] x86/topology: Disable CPU online/offline control for TDX guests Kirill A. Shutemov
2022-03-02 14:28 ` [PATCHv5 25/30] x86/tdx: Make pages shared in ioremap() Kirill A. Shutemov
2022-03-08 22:02   ` Dave Hansen
2022-03-02 14:28 ` [PATCHv5 26/30] x86/mm/cpa: Add support for TDX shared memory Kirill A. Shutemov
2022-03-09 19:44   ` Dave Hansen
2022-03-02 14:28 ` [PATCHv5 27/30] x86/kvm: Use bounce buffers for TD guest Kirill A. Shutemov
2022-03-09 20:07   ` Dave Hansen
2022-03-10 14:29     ` Tom Lendacky
2022-03-10 14:51       ` Christoph Hellwig
2022-03-02 14:28 ` [PATCHv5 28/30] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kirill A. Shutemov
2022-03-09 20:39   ` Dave Hansen
2022-03-02 14:28 ` [PATCHv5 29/30] ACPICA: Avoid cache flush inside virtual machines Kirill A. Shutemov
2022-03-02 16:13   ` Dan Williams
2022-03-09 20:56   ` Dave Hansen
2022-03-02 14:28 ` [PATCHv5 30/30] Documentation/x86: Document TDX kernel architecture Kirill A. Shutemov
2022-03-09 21:49   ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).