linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 00/22] TDX host kernel support
@ 2023-06-26 14:12 Kai Huang
  2023-06-26 14:12 ` [PATCH v12 01/22] x86/tdx: Define TDX supported page sizes as macros Kai Huang
                   ` (22 more replies)
  0 siblings, 23 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  TDX specs are available in [1].

This series is the initial support to enable TDX with minimal code to
allow KVM to create and run TDX guests.  KVM support for TDX is being
developed separately[2].  A new "userspace inaccessible memfd" approach
to support TDX private memory is also being developed[3].  The KVM will
only support the new "userspace inaccessible memfd" as TDX guest memory.

Also, a few first generations of TDX hardware have an erratum[4], and
require additional handing.

This series doesn't aim to support all functionalities, and doesn't aim
to resolve all things perfectly.  All other optimizations will be posted
as follow-up once this initial TDX support is upstreamed.

(For memory hotplug, sorry for broadcasting widely but I cc'ed the
linux-mm@kvack.org following Kirill's suggestion so MM experts can also
help to provide comments.)

Hi Dave/Kirill/Tony/David and all,

Thanks for your review on the previous versions.  Appreciate your review
on this version and any tag if patches look good to you.  Thanks!

----- Changelog history: ------

- v11 -> v12:
 - Addressed comments in v11 from Dave/Kirill/David and others.
 - Collected review tags from Dave/Kirill/David and others.
 - Splitted the SEAMCALL infrastructure patch into 2 patches for better
   reveiw.
 - One more patch to change to keep TDMRs when module initialization is
   successful for better review.

 v11: https://lore.kernel.org/lkml/cover.1685887183.git.kai.huang@intel.com/T/

- v10 -> v11:

 - Addressed comments in v10
 - Added patches to handle TDX "partial write machine check" erratum.
 - Added a new patch to handle running out of entropy in common code.
 - Fixed a bug in kexec() support.

 v10: https://lore.kernel.org/kvm/cover.1678111292.git.kai.huang@intel.com/

- v9 -> v10:

 - Changed the per-cpu initalization handling
   - Gave up "ensuring all online cpus are TDX-runnable when TDX module
     is initialized", but just provide two basic functions, tdx_enable()
     and tdx_cpu_enable(), to let the user of TDX to make sure the
     tdx_cpu_enable() has been done successfully when the user wants to
     use particular cpu for TDX.
   - Thus, moved per-cpu initialization out of tdx_enable().  Now
     tdx_enable() just assumes VMXON and tdx_cpu_enable() has been done
     on all online cpus before calling it.
   - Merged the tdx_enable() skeleton patch and per-cpu initialization
     patch together to tell better story.
   - Moved "SEAMCALL infrastructure" patch before the tdx_enable() patch.

 v9: https://lore.kernel.org/lkml/cover.1676286526.git.kai.huang@intel.com/

- v8 -> v9:

 - Added patches to handle TDH.SYS.INIT and TDH.SYS.LP.INIT back.
 - Other changes please refer to changelog histroy in individual patches.

 v8: https://lore.kernel.org/lkml/cover.1670566861.git.kai.huang@intel.com/

- v7 -> v8:

 - 200+ LOC removed (from 1800+ -> 1600+).
 - Removed patches to do TDH.SYS.INIT and TDH.SYS.LP.INIT
   (Dave/Peter/Thomas).
 - Removed patch to shut down TDX module (Sean).
 - For memory hotplug, changed to reject non-TDX memory from
   arch_add_memory() to memory_notifier (Dan/David).
 - Simplified the "skeletion patch" as a result of removing
   TDH.SYS.LP.INIT patch.
 - Refined changelog/comments for most of the patches (to tell better
   story, remove silly comments, etc) (Dave).
 - Added new 'struct tdmr_info_list' struct, and changed all TDMR related
   patches to use it (Dave).
 - Effectively merged patch "Reserve TDX module global KeyID" and
   "Configure TDX module with TDMRs and global KeyID", and removed the
   static variable 'tdx_global_keyid', following Dave's suggestion on
   making tdx_sysinfo local variable.
 - For detailed changes please see individual patch changelog history.

 v7: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- v6 -> v7:
  - Added memory hotplug support.
  - Changed how to choose the list of "TDX-usable" memory regions from at
    kernel boot time to TDX module initialization time.
  - Addressed comments received in previous versions. (Andi/Dave).
  - Improved the commit message and the comments of kexec() support patch,
    and the patch handles returnning PAMTs back to the kernel when TDX
    module initialization fails. Please also see "kexec()" section below.
  - Changed the documentation patch accordingly.
  - For all others please see individual patch changelog history.

 v6: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- v5 -> v6:

  - Removed ACPI CPU/memory hotplug patches. (Intel internal discussion)
  - Removed patch to disable driver-managed memory hotplug (Intel
    internal discussion).
  - Added one patch to introduce enum type for TDX supported page size
    level to replace the hard-coded values in TDX guest code (Dave).
  - Added one patch to make TDX depends on X2APIC being enabled (Dave).
  - Added one patch to build all boot-time present memory regions as TDX
    memory during kernel boot.
  - Added Reviewed-by from others to some patches.
  - For all others please see individual patch changelog history.

 v5: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- v4 -> v5:

  This is essentially a resent of v4.  Sorry I forgot to consult
  get_maintainer.pl when sending out v4, so I forgot to add linux-acpi
  and linux-mm mailing list and the relevant people for 4 new patches.

  There are also very minor code and commit message update from v4:

  - Rebased to latest tip/x86/tdx.
  - Fixed a checkpatch issue that I missed in v4.
  - Removed an obsoleted comment that I missed in patch 6.
  - Very minor update to the commit message of patch 12.

  For other changes to individual patches since v3, please refer to the
  changelog histroy of individual patches (I just used v3 -> v5 since
  there's basically no code change to v4).

 v4: https://lore.kernel.org/lkml/98c84c31d8f062a0b50a69ef4d3188bc259f2af2.1654025431.git.kai.huang@intel.com/T/

- v3 -> v4 (addressed Dave's comments, and other comments from others):

 - Simplified SEAMRR and TDX keyID detection.
 - Added patches to handle ACPI CPU hotplug.
 - Added patches to handle ACPI memory hotplug and driver managed memory
   hotplug.
 - Removed tdx_detect() but only use single tdx_init().
 - Removed detecting TDX module via P-SEAMLDR.
 - Changed from using e820 to using memblock to convert system RAM to TDX
   memory.
 - Excluded legacy PMEM from TDX memory.
 - Removed the boot-time command line to disable TDX patch.
 - Addressed comments for other individual patches (please see individual
   patches).
 - Improved the documentation patch based on the new implementation.

 v3: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- V2 -> v3:

 - Addressed comments from Isaku.
  - Fixed memory leak and unnecessary function argument in the patch to
    configure the key for the global keyid (patch 17).
  - Enhanced a little bit to the patch to get TDX module and CMR
    information (patch 09).
  - Fixed an unintended change in the patch to allocate PAMT (patch 13).
 - Addressed comments from Kevin:
  - Slightly improvement on commit message to patch 03.
 - Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in
   seamrr_enabled() (patch 04).
 - Changed documentation patch to add TDX host kernel support materials
   to Documentation/x86/tdx.rst together with TDX guest staff, instead
   of a standalone file (patch 21)
 - Very minor improvement in commit messages.

 v2: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

- RFC (v1) -> v2:
  - Rebased to Kirill's latest TDX guest code.
  - Fixed two issues that are related to finding all RAM memory regions
    based on e820.
  - Minor improvement on comments and commit messages.

 v1: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/

== Background ==

TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM)
and a new isolated range pointed by the SEAM Ranger Register (SEAMRR).
A CPU-attested software module called 'the TDX module' runs in the new
isolated region as a trusted hypervisor to create/run protected VMs.

TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
as TDX private KeyIDs, which are only accessible within the SEAM mode.

TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated
secure processor to provide crypto-protection.  The firmware runs on the
secure processor acts a similar role as the TDX module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction.  This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.

Before being able to manage TD guests, the TDX module must be loaded
and properly initialized.  This series assumes the TDX module is loaded
by BIOS before the kernel boots.

How to initialize the TDX module is described at TDX module 1.0
specification, chapter "13.Intel TDX Module Lifecycle: Enumeration,
Initialization and Shutdown".

== Design Considerations ==

1. Initialize the TDX module at runtime

There are basically two ways the TDX module could be initialized: either
in early boot, or at runtime before the first TDX guest is run.  This
series implements the runtime initialization.

Also, TDX requires a per-cpu initialization SEAMCALL to be done before
making any SEAMCALL on that cpu.

This series adds two functions: tdx_cpu_enable() and tdx_enable() to do
per-cpu initialization and module initialization respectively.

2. CPU hotplug

DX doesn't support physical (ACPI) CPU hotplug.  A non-buggy BIOS should
never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
event to the kernel.  This series doesn't handle physical (ACPI) CPU
hotplug at all but depends on the BIOS to behave correctly.

Also, tdx_cpu_enable() will simply return error for any hot-added cpu if
something insane happened.

Note TDX works with CPU logical online/offline, thus this series still
allows to do logical CPU online/offline.

3. Kernel policy on TDX memory

The TDX module reports a list of "Convertible Memory Region" (CMR) to
indicate which memory regions are TDX-capable.  The TDX architecture
allows the VMM to designate specific convertible memory regions as usable
for TDX private memory.

The initial support of TDX guests will only allocate TDX private memory
from the global page allocator.  This series chooses to designate _all_
system RAM in the core-mm at the time of initializing TDX module as TDX
memory to guarantee all pages in the page allocator are TDX pages.

4. Memory Hotplug

After the kernel passes all "TDX-usable" memory regions to the TDX
module, the set of "TDX-usable" memory regions are fixed during module's
runtime.  No more "TDX-usable" memory can be added to the TDX module
after that.

To achieve above "to guarantee all pages in the page allocator are TDX
pages", this series simply choose to reject any non-TDX-usable memory in
memory hotplug.

5. Physical Memory Hotplug

Note TDX assumes convertible memory is always physically present during
machine's runtime.  A non-buggy BIOS should never support hot-removal of
any convertible memory.  This implementation doesn't handle ACPI memory
removal but depends on the BIOS to behave correctly.

Also, if something insane really happened, 4) makes sure either TDX
cannot be enabled or hot-added memory will be rejected after TDX gets
enabled.

6. Kexec()

Similar to AMD's SME, in kexec() kernel needs to flush dirty cachelines
of TDX private memory otherwise they may silently corrupt the new kernel.

7. TDX erratum

The first few generations of TDX hardware have an erratum.  A partial
write to a TDX private memory cacheline will silently "poison" the
line.  Subsequent reads will consume the poison and generate a machine
check.

The fast warm reset reboot doesn't reset TDX private memory.  With this
erratum, all TDX private pages needs to be converted back to normal
before a fast warm reset reboot or booting to the new kernel in kexec().
Otherwise, the new kernel may get unexpected machine check.

In normal condition, triggering the erratum in Linux requires some kind
of kernel bug involving relatively exotic memory writes to TDX private
memory and will manifest via spurious-looking machine checks when
reading the affected memory.  Machine check handler is improved to deal
with such machine check.


[1]: TDX specs
https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html

[2]: KVM TDX basic feature support
https://lore.kernel.org/kvm/cover.1685333727.git.isaku.yamahata@intel.com/T/#t

[3]: KVM: mm: fd-based approach for supporting KVM
https://lore.kernel.org/kvm/20221202061347.1070246-1-chao.p.peng@linux.intel.com/

[4]: TDX erratum
https://cdrdv2.intel.com/v1/dl/getContent/772415?explicitVersion=true




Kai Huang (22):
  x86/tdx: Define TDX supported page sizes as macros
  x86/virt/tdx: Detect TDX during kernel boot
  x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
  x86/cpu: Detect TDX partial write machine check erratum
  x86/virt/tdx: Add SEAMCALL infrastructure
  x86/virt/tdx: Handle SEAMCALL running out of entropy error
  x86/virt/tdx: Add skeleton to enable TDX on demand
  x86/virt/tdx: Get information about TDX module and TDX-capable memory
  x86/virt/tdx: Use all system memory when initializing TDX module as
    TDX memory
  x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX
    memory regions
  x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
  x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  x86/virt/tdx: Designate reserved areas for all TDMRs
  x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
  x86/virt/tdx: Configure global KeyID on all packages
  x86/virt/tdx: Initialize all TDMRs
  x86/kexec: Flush cache of TDX private memory
  x86/virt/tdx: Keep TDMRs when module initialization is successful
  x86/kexec(): Reset TDX private memory on platforms with TDX erratum
  x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  x86/mce: Improve error log of kernel space TDX #MC due to erratum
  Documentation/x86: Add documentation for TDX host support

 Documentation/arch/x86/tdx.rst     |  189 +++-
 arch/x86/Kconfig                   |   15 +
 arch/x86/Makefile                  |    2 +
 arch/x86/coco/tdx/tdx.c            |    6 +-
 arch/x86/include/asm/cpufeatures.h |    1 +
 arch/x86/include/asm/msr-index.h   |    3 +
 arch/x86/include/asm/tdx.h         |   26 +
 arch/x86/kernel/cpu/intel.c        |   17 +
 arch/x86/kernel/cpu/mce/core.c     |   33 +
 arch/x86/kernel/machine_kexec_64.c |    9 +
 arch/x86/kernel/process.c          |    7 +-
 arch/x86/kernel/reboot.c           |   15 +
 arch/x86/kernel/setup.c            |    2 +
 arch/x86/virt/Makefile             |    2 +
 arch/x86/virt/vmx/Makefile         |    2 +
 arch/x86/virt/vmx/tdx/Makefile     |    2 +
 arch/x86/virt/vmx/tdx/seamcall.S   |   52 +
 arch/x86/virt/vmx/tdx/tdx.c        | 1542 ++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h        |  151 +++
 arch/x86/virt/vmx/tdx/tdxcall.S    |   19 +-
 20 files changed, 2078 insertions(+), 17 deletions(-)
 create mode 100644 arch/x86/virt/Makefile
 create mode 100644 arch/x86/virt/vmx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h


base-commit: 94142c9d1bdf1c18027a42758ceb6bdd59a92012
-- 
2.40.1


^ permalink raw reply	[flat|nested] 159+ messages in thread

* [PATCH v12 01/22] x86/tdx: Define TDX supported page sizes as macros
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-26 14:12 ` [PATCH v12 02/22] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

TDX supports 4K, 2M and 1G page sizes.  The corresponding values are
defined by the TDX module spec and used as TDX module ABI.  Currently,
they are used in try_accept_one() when the TDX guest tries to accept a
page.  However currently try_accept_one() uses hard-coded magic values.

Define TDX supported page sizes as macros and get rid of the hard-coded
values in try_accept_one().  TDX host support will need to use them too.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---

v11 -> v12:
 - No change.

v10 -> v11:
 - Added David's Reviewed-by.

v9 -> v10:
 - No change.

v8 -> v9:
 - Added Dave's Reviewed-by

v7 -> v8:
 - Improved the comment of TDX supported page sizes macros (Dave)

v6 -> v7:
 - Removed the helper to convert kernel page level to TDX page level.
 - Changed to use macro to define TDX supported page sizes.

---
 arch/x86/coco/tdx/tdx.c    | 6 +++---
 arch/x86/include/asm/tdx.h | 5 +++++
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 5b8056f6c83f..b34851297ae5 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -755,13 +755,13 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
 	 */
 	switch (pg_level) {
 	case PG_LEVEL_4K:
-		page_size = 0;
+		page_size = TDX_PS_4K;
 		break;
 	case PG_LEVEL_2M:
-		page_size = 1;
+		page_size = TDX_PS_2M;
 		break;
 	case PG_LEVEL_1G:
-		page_size = 2;
+		page_size = TDX_PS_1G;
 		break;
 	default:
 		return false;
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 28d889c9aa16..25fd6070dc0b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -20,6 +20,11 @@
 
 #ifndef __ASSEMBLY__
 
+/* TDX supported page sizes from the TDX module ABI. */
+#define TDX_PS_4K	0
+#define TDX_PS_2M	1
+#define TDX_PS_1G	2
+
 /*
  * Used to gather the output registers values of the TDCALL and SEAMCALL
  * instructions when requesting services from the TDX module.
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 02/22] x86/virt/tdx: Detect TDX during kernel boot
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
  2023-06-26 14:12 ` [PATCH v12 01/22] x86/tdx: Define TDX supported page sizes as macros Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-26 14:12 ` [PATCH v12 03/22] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Kai Huang
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  A CPU-attested software module
called 'the TDX module' runs inside a new isolated memory range as a
trusted hypervisor to manage and run protected VMs.

Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME.  The memory encryption hardware underpinning MKTME is also
used for Intel TDX.  TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs.  The
BIOS is responsible for partitioning the "KeyID" space between legacy
MKTME and TDX.  The KeyIDs reserved for TDX are called 'TDX private
KeyIDs' or 'TDX KeyIDs' for short.

During machine boot, TDX microcode verifies that the BIOS programmed TDX
private KeyIDs consistently and correctly programmed across all CPU
packages.  The MSRs are locked in this state after verification.  This
is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration:
it indicates not just that the hardware supports TDX, but that all the
boot-time security checks passed.

The TDX module is expected to be loaded by the BIOS when it enables TDX,
but the kernel needs to properly initialize it before it can be used to
create and run any TDX guests.  The TDX module will be initialized by
the KVM subsystem when KVM wants to use TDX.

Add a new early_initcall(tdx_init) to detect the TDX by detecting TDX
private KeyIDs.  Also add a function to report whether TDX is enabled by
the BIOS.  Similar to AMD SME, kexec() will use it to determine whether
cache flush is needed.

The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
to protect its metadata.  Each TDX guest also needs a TDX KeyID for its
own protection.  Just use the first TDX KeyID as the global KeyID and
leave the rest for TDX guests.  If no TDX KeyID is left for TDX guests,
disable TDX as initializing the TDX module alone is useless.

To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
TDX host kernel support.  Add a new Kconfig option CONFIG_INTEL_TDX_HOST
to opt-in TDX host kernel support (to distinguish with TDX guest kernel
support).  So far only KVM uses TDX.  Make the new config option depend
on KVM_INTEL.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---

v11 -> v12:
 - Improve setting up guest's TDX keyID range (David)
   - ++tdx_keyid_start -> tdx_keyid_start + 1
   - --nr_tdx_keyids -> nr_tdx_keyids - 1 
 - 'return -ENODEV' instead of 'goto no_tdx' (Sathy)
 - pr_info() -> pr_err() (Isaku)
 - Added tags from Isaku/David

v10 -> v11 (David):
 - "host kernel" -> "the host kernel"
 - "protected VM" -> "confidential VM".
 - Moved setting tdx_global_keyid to the end of tdx_init().

v9 -> v10:
 - No change.

v8 -> v9:
 - Moved MSR macro from local tdx.h to <asm/msr-index.h> (Dave).
 - Moved reserving the TDX global KeyID from later patch to here.
 - Changed 'tdx_keyid_start' and 'nr_tdx_keyids' to
   'tdx_guest_keyid_start' and 'tdx_nr_guest_keyids' to represent KeyIDs
   can be used by guest. (Dave)
 - Slight changelog update according to above changes.

v7 -> v8: (address Dave's comments)
 - Improved changelog:
    - "KVM user" -> "The TDX module will be initialized by KVM when ..."
    - Changed "tdx_int" part to "Just say what this patch is doing"
    - Fixed the last sentence of "kexec()" paragraph
  - detect_tdx() -> record_keyid_partitioning()
  - Improved how to calculate tdx_keyid_start.
  - tdx_keyid_num -> nr_tdx_keyids.
  - Improved dmesg printing.
  - Add comment to clear_tdx().

v6 -> v7:
 - No change.

v5 -> v6:
 - Removed SEAMRR detection to make code simpler.
 - Removed the 'default N' in the KVM_TDX_HOST Kconfig (Kirill).
 - Changed to use 'obj-y' in arch/x86/virt/vmx/tdx/Makefile (Kirill).

---
 arch/x86/Kconfig                 | 12 +++++
 arch/x86/Makefile                |  2 +
 arch/x86/include/asm/msr-index.h |  3 ++
 arch/x86/include/asm/tdx.h       |  7 +++
 arch/x86/virt/Makefile           |  2 +
 arch/x86/virt/vmx/Makefile       |  2 +
 arch/x86/virt/vmx/tdx/Makefile   |  2 +
 arch/x86/virt/vmx/tdx/tdx.c      | 90 ++++++++++++++++++++++++++++++++
 8 files changed, 120 insertions(+)
 create mode 100644 arch/x86/virt/Makefile
 create mode 100644 arch/x86/virt/vmx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..191587f75810 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1952,6 +1952,18 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config INTEL_TDX_HOST
+	bool "Intel Trust Domain Extensions (TDX) host support"
+	depends on CPU_SUP_INTEL
+	depends on X86_64
+	depends on KVM_INTEL
+	help
+	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
+	  host and certain physical attacks.  This option enables necessary TDX
+	  support in the host kernel to run confidential VMs.
+
+	  If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index b39975977c03..ec0e71d8fa30 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -252,6 +252,8 @@ archheaders:
 
 libs-y  += arch/x86/lib/
 
+core-y += arch/x86/virt/
+
 # drivers-y are linked after core-y
 drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
 drivers-$(CONFIG_PCI)            += arch/x86/pci/
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3aedae61af4f..6d8f15b1552c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -523,6 +523,9 @@
 #define MSR_RELOAD_PMC0			0x000014c1
 #define MSR_RELOAD_FIXED_CTR0		0x00001309
 
+/* KeyID partitioning between MKTME and TDX */
+#define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
+
 /*
  * AMD64 MSRs. Not complete. See the architecture manual for a more
  * complete list.
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 25fd6070dc0b..4dfe2e794411 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -94,5 +94,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 	return -ENODEV;
 }
 #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
+
+#ifdef CONFIG_INTEL_TDX_HOST
+bool platform_tdx_enabled(void);
+#else	/* !CONFIG_INTEL_TDX_HOST */
+static inline bool platform_tdx_enabled(void) { return false; }
+#endif	/* CONFIG_INTEL_TDX_HOST */
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
new file mode 100644
index 000000000000..1e36502cd738
--- /dev/null
+++ b/arch/x86/virt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y	+= vmx/
diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
new file mode 100644
index 000000000000..feebda21d793
--- /dev/null
+++ b/arch/x86/virt/vmx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx/
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
new file mode 100644
index 000000000000..93ca8b73e1f1
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += tdx.o
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
new file mode 100644
index 000000000000..908590e85749
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -0,0 +1,90 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2023 Intel Corporation.
+ *
+ * Intel Trusted Domain Extensions (TDX) support
+ */
+
+#define pr_fmt(fmt)	"tdx: " fmt
+
+#include <linux/types.h>
+#include <linux/cache.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/printk.h>
+#include <asm/msr-index.h>
+#include <asm/msr.h>
+#include <asm/tdx.h>
+
+static u32 tdx_global_keyid __ro_after_init;
+static u32 tdx_guest_keyid_start __ro_after_init;
+static u32 tdx_nr_guest_keyids __ro_after_init;
+
+static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
+					    u32 *nr_tdx_keyids)
+{
+	u32 _nr_mktme_keyids, _tdx_keyid_start, _nr_tdx_keyids;
+	int ret;
+
+	/*
+	 * IA32_MKTME_KEYID_PARTIONING:
+	 *   Bit [31:0]:	Number of MKTME KeyIDs.
+	 *   Bit [63:32]:	Number of TDX private KeyIDs.
+	 */
+	ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &_nr_mktme_keyids,
+			&_nr_tdx_keyids);
+	if (ret)
+		return -ENODEV;
+
+	if (!_nr_tdx_keyids)
+		return -ENODEV;
+
+	/* TDX KeyIDs start after the last MKTME KeyID. */
+	_tdx_keyid_start = _nr_mktme_keyids + 1;
+
+	*tdx_keyid_start = _tdx_keyid_start;
+	*nr_tdx_keyids = _nr_tdx_keyids;
+
+	return 0;
+}
+
+static int __init tdx_init(void)
+{
+	u32 tdx_keyid_start, nr_tdx_keyids;
+	int err;
+
+	err = record_keyid_partitioning(&tdx_keyid_start, &nr_tdx_keyids);
+	if (err)
+		return err;
+
+	pr_info("BIOS enabled: private KeyID range [%u, %u)\n",
+			tdx_keyid_start, tdx_keyid_start + nr_tdx_keyids);
+
+	/*
+	 * The TDX module itself requires one 'global KeyID' to protect
+	 * its metadata.  If there's only one TDX KeyID, there won't be
+	 * any left for TDX guests thus there's no point to enable TDX
+	 * at all.
+	 */
+	if (nr_tdx_keyids < 2) {
+		pr_err("initialization failed: too few private KeyIDs available.\n");
+		return -ENODEV;
+	}
+
+	/*
+	 * Just use the first TDX KeyID as the 'global KeyID' and
+	 * leave the rest for TDX guests.
+	 */
+	tdx_global_keyid = tdx_keyid_start;
+	tdx_guest_keyid_start = tdx_keyid_start + 1;
+	tdx_nr_guest_keyids = nr_tdx_keyids - 1;
+
+	return 0;
+}
+early_initcall(tdx_init);
+
+/* Return whether the BIOS has enabled TDX */
+bool platform_tdx_enabled(void)
+{
+	return !!tdx_global_keyid;
+}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 03/22] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
  2023-06-26 14:12 ` [PATCH v12 01/22] x86/tdx: Define TDX supported page sizes as macros Kai Huang
  2023-06-26 14:12 ` [PATCH v12 02/22] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-26 14:12 ` [PATCH v12 04/22] x86/cpu: Detect TDX partial write machine check erratum Kai Huang
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

TDX capable platforms are locked to X2APIC mode and cannot fall back to
the legacy xAPIC mode when TDX is enabled by the BIOS.  TDX host support
requires x2APIC.  Make INTEL_TDX_HOST depend on X86_X2APIC.

Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v11 -> v12:
 - Added Kirill's Reviewd-by.

v10 -> v11:
 - Added David's Reviewed-by.

v9 -> v10:
 - No change.

v8 -> v9:
 - Added Dave's Reviewed-by.

v7 -> v8: (Dave)
 - Only make INTEL_TDX_HOST depend on X86_X2APIC but removed other code
 - Rewrote the changelog.

v6 -> v7:
 - Changed to use "Link" for the two lore links to get rid of checkpatch
   warning.


---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 191587f75810..f0f3f1a2c8e0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1957,6 +1957,7 @@ config INTEL_TDX_HOST
 	depends on CPU_SUP_INTEL
 	depends on X86_64
 	depends on KVM_INTEL
+	depends on X86_X2APIC
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 04/22] x86/cpu: Detect TDX partial write machine check erratum
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (2 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 03/22] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-29 11:22   ` David Hildenbrand
  2023-06-26 14:12 ` [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure Kai Huang
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

TDX memory has integrity and confidentiality protections.  Violations of
this integrity protection are supposed to only affect TDX operations and
are never supposed to affect the host kernel itself.  In other words,
the host kernel should never, itself, see machine checks induced by the
TDX integrity hardware.

Alas, the first few generations of TDX hardware have an erratum.  A
partial write to a TDX private memory cacheline will silently "poison"
the line.  Subsequent reads will consume the poison and generate a
machine check.  According to the TDX hardware spec, neither of these
things should have happened.

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

With this erratum, there are additional things need to be done.  Similar
to other CPU bugs, use a CPU bug bit to indicate this erratum, and
detect this erratum during early boot.  Note this bug reflects the
hardware thus it is detected regardless of whether the kernel is built
with TDX support or not.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v11 -> v12:
 - Added Kirill's tag
 - Changed to detect the erratum in early_init_intel() (Kirill)

v10 -> v11:
 - New patch


---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/kernel/cpu/intel.c        | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index cb8ca46213be..dc8701f8d88b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -483,5 +483,6 @@
 #define X86_BUG_RETBLEED		X86_BUG(27) /* CPU is affected by RETBleed */
 #define X86_BUG_EIBRS_PBRSB		X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
 #define X86_BUG_SMT_RSB			X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */
+#define X86_BUG_TDX_PW_MCE		X86_BUG(30) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 1c4639588ff9..e6c3107adc15 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -358,6 +358,21 @@ int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type)
 }
 EXPORT_SYMBOL_GPL(intel_microcode_sanity_check);
 
+static void check_tdx_erratum(struct cpuinfo_x86 *c)
+{
+	/*
+	 * These CPUs have an erratum.  A partial write from non-TD
+	 * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
+	 * private memory poisons that memory, and a subsequent read of
+	 * that memory triggers #MC.
+	 */
+	switch (c->x86_model) {
+	case INTEL_FAM6_SAPPHIRERAPIDS_X:
+	case INTEL_FAM6_EMERALDRAPIDS_X:
+		setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
+	}
+}
+
 static void early_init_intel(struct cpuinfo_x86 *c)
 {
 	u64 misc_enable;
@@ -509,6 +524,8 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 	 */
 	if (detect_extended_topology_early(c) < 0)
 		detect_ht_early(c);
+
+	check_tdx_erratum(c);
 }
 
 static void bsp_init_intel(struct cpuinfo_x86 *c)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (3 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 04/22] x86/cpu: Detect TDX partial write machine check erratum Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-27  9:48   ` kirill.shutemov
                     ` (2 more replies)
  2023-06-26 14:12 ` [PATCH v12 06/22] x86/virt/tdx: Handle SEAMCALL running out of entropy error Kai Huang
                   ` (17 subsequent siblings)
  22 siblings, 3 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).  This
mode runs only the TDX module itself or other code to load the TDX
module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction.  This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.  The TDX
module establishes a new SEAMCALL ABI which allows the host to
initialize the module and to manage VMs.

Add infrastructure to make SEAMCALLs.  The SEAMCALL ABI is very similar
to the TDCALL ABI and leverages much TDCALL infrastructure.

Also add a wrapper function of SEAMCALL to convert SEAMCALL error code
to the kernel error code, and print out SEAMCALL error code to help the
user to understand what went wrong.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
---

v11 -> v12:
 - Moved _ASM_EXT_TABLE() for #UD/#GP to a later patch for better patch
   review, and removed related part from changelog.
 - Minor code changes in seamcall() (David)
 - Added Isaku's tag

v10 -> v11:
 - No update

v9 -> v10:
 - Make the TDX_SEAMCALL_{GP|UD} error codes unconditional but doesn't
   define them when INTEL_TDX_HOST is enabled. (Dave)
 - Slightly improved changelog to explain why add assembly code to handle
   #UD and #GP.

v8 -> v9:
 - Changed patch title (Dave).
 - Enhanced seamcall() to include the cpu id to the error message when
   SEAMCALL fails.

v7 -> v8:
 - Improved changelog (Dave):
   - Trim down some sentences (Dave).
   - Removed __seamcall() and seamcall() function name and changed
     accordingly (Dave).
   - Improved the sentence explaining why to handle #GP (Dave).
 - Added code to print out error message in seamcall(), following
   the idea that tdx_enable() to return universal error and print out
   error message to make clear what's going wrong (Dave).  Also mention
   this in changelog.

v6 -> v7:
 - No change.

v5 -> v6:
 - Added code to handle #UD and #GP (Dave).
 - Moved the seamcall() wrapper function to this patch, and used a
   temporary __always_unused to avoid compile warning (Dave).

- v3 -> v5 (no feedback on v4):
 - Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the
   SEAMCALL itself fails.
 - Improve the changelog.


---
 arch/x86/virt/vmx/tdx/Makefile   |  2 +-
 arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.c      | 42 ++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h      | 10 ++++++
 4 files changed, 105 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
index 93ca8b73e1f1..38d534f2c113 100644
--- a/arch/x86/virt/vmx/tdx/Makefile
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -1,2 +1,2 @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-y += tdx.o
+obj-y += tdx.o seamcall.o
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
new file mode 100644
index 000000000000..f81be6b9c133
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+#include "tdxcall.S"
+
+/*
+ * __seamcall() - Host-side interface functions to SEAM software module
+ *		  (the P-SEAMLDR or the TDX module).
+ *
+ * Transform function call register arguments into the SEAMCALL register
+ * ABI.  Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails,
+ * or the completion status of the SEAMCALL leaf function.  Additional
+ * output operands are saved in @out (if it is provided by the caller).
+ *
+ *-------------------------------------------------------------------------
+ * SEAMCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX                 - SEAMCALL Leaf number.
+ * RCX,RDX,R8-R9       - SEAMCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX                 - SEAMCALL completion status code.
+ * RCX,RDX,R8-R11      - SEAMCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __seamcall() function ABI:
+ *
+ * @fn  (RDI)          - SEAMCALL Leaf number, moved to RAX
+ * @rcx (RSI)          - Input parameter 1, moved to RCX
+ * @rdx (RDX)          - Input parameter 2, moved to RDX
+ * @r8  (RCX)          - Input parameter 3, moved to R8
+ * @r9  (R8)           - Input parameter 4, moved to R9
+ *
+ * @out (R9)           - struct tdx_module_output pointer
+ *			 stored temporarily in R12 (not
+ *			 used by the P-SEAMLDR or the TDX
+ *			 module). It can be NULL.
+ *
+ * Return (via RAX) the completion status of the SEAMCALL, or
+ * TDX_SEAMCALL_VMFAILINVALID.
+ */
+SYM_FUNC_START(__seamcall)
+	FRAME_BEGIN
+	TDX_MODULE_CALL host=1
+	FRAME_END
+	RET
+SYM_FUNC_END(__seamcall)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 908590e85749..f8233cba5931 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -12,14 +12,56 @@
 #include <linux/init.h>
 #include <linux/errno.h>
 #include <linux/printk.h>
+#include <linux/smp.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/tdx.h>
+#include "tdx.h"
 
 static u32 tdx_global_keyid __ro_after_init;
 static u32 tdx_guest_keyid_start __ro_after_init;
 static u32 tdx_nr_guest_keyids __ro_after_init;
 
+/*
+ * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
+ * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
+ * leaf function return code and the additional output respectively if
+ * not NULL.
+ */
+static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+				    u64 *seamcall_ret,
+				    struct tdx_module_output *out)
+{
+	u64 sret;
+	int cpu;
+
+	/* Need a stable CPU id for printing error message */
+	cpu = get_cpu();
+	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
+	put_cpu();
+
+	/* Save SEAMCALL return code if the caller wants it */
+	if (seamcall_ret)
+		*seamcall_ret = sret;
+
+	switch (sret) {
+	case 0:
+		/* SEAMCALL was successful */
+		return 0;
+	case TDX_SEAMCALL_VMFAILINVALID:
+		pr_err_once("module is not loaded.\n");
+		return -ENODEV;
+	default:
+		pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
+				cpu, fn, sret);
+		if (out)
+			pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
+					out->rcx, out->rdx, out->r8,
+					out->r9, out->r10, out->r11);
+		return -EIO;
+	}
+}
+
 static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
 					    u32 *nr_tdx_keyids)
 {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
new file mode 100644
index 000000000000..48ad1a1ba737
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_VIRT_TDX_H
+#define _X86_VIRT_TDX_H
+
+#include <linux/types.h>
+
+struct tdx_module_output;
+u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+	       struct tdx_module_output *out);
+#endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 06/22] x86/virt/tdx: Handle SEAMCALL running out of entropy error
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (4 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-28 13:02   ` Peter Zijlstra
  2023-06-26 14:12 ` [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

Some SEAMCALLs use the RDRAND hardware and can fail for the same reasons
as RDRAND.  Use the kernel RDRAND retry logic for them.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---

v11 -> v12:
 - Added tags from Dave/Kirill/David.
 - Improved changelog (Dave).
 - Slight code improvement (David)
   - Initialize retry directly when declaring it.
   - Simplify comment around mimic rdrand_long().

v10 -> v11:
 - New patch

---
 arch/x86/virt/vmx/tdx/tdx.c | 16 ++++++++++++++--
 arch/x86/virt/vmx/tdx/tdx.h | 17 +++++++++++++++++
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index f8233cba5931..141d12376c4d 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -15,6 +15,7 @@
 #include <linux/smp.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
+#include <asm/archrandom.h>
 #include <asm/tdx.h>
 #include "tdx.h"
 
@@ -32,12 +33,23 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 				    u64 *seamcall_ret,
 				    struct tdx_module_output *out)
 {
+	int cpu, retry = RDRAND_RETRY_LOOPS;
 	u64 sret;
-	int cpu;
 
 	/* Need a stable CPU id for printing error message */
 	cpu = get_cpu();
-	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
+
+	/*
+	 * Certain SEAMCALL leaf functions may return error due to
+	 * running out of entropy, in which case the SEAMCALL should
+	 * be retried.  Handle this in SEAMCALL common function.
+	 *
+	 * Mimic rdrand_long() retry behavior.
+	 */
+	do {
+		sret = __seamcall(fn, rcx, rdx, r8, r9, out);
+	} while (sret == TDX_RND_NO_ENTROPY && --retry);
+
 	put_cpu();
 
 	/* Save SEAMCALL return code if the caller wants it */
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 48ad1a1ba737..55dbb1b8c971 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -4,6 +4,23 @@
 
 #include <linux/types.h>
 
+/*
+ * This file contains both macros and data structures defined by the TDX
+ * architecture and Linux defined software data structures and functions.
+ * The two should not be mixed together for better readability.  The
+ * architectural definitions come first.
+ */
+
+/*
+ * TDX SEAMCALL error codes
+ */
+#define TDX_RND_NO_ENTROPY	0x8000020300000000ULL
+
+/*
+ * Do not put any hardware-defined TDX structure representations below
+ * this comment!
+ */
+
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	       struct tdx_module_output *out);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (5 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 06/22] x86/virt/tdx: Handle SEAMCALL running out of entropy error Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-26 21:21   ` Sathyanarayanan Kuppuswamy
                     ` (5 more replies)
  2023-06-26 14:12 ` [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
                   ` (15 subsequent siblings)
  22 siblings, 6 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

To enable TDX the kernel needs to initialize TDX from two perspectives:
1) Do a set of SEAMCALLs to initialize the TDX module to make it ready
to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
on one logical cpu before the kernel wants to make any other SEAMCALLs
on that cpu (including those involved during module initialization and
running TDX guests).

The TDX module can be initialized only once in its lifetime.  Instead
of always initializing it at boot time, this implementation chooses an
"on demand" approach to initialize TDX until there is a real need (e.g
when requested by KVM).  This approach has below pros:

1) It avoids consuming the memory that must be allocated by kernel and
given to the TDX module as metadata (~1/256th of the TDX-usable memory),
and also saves the CPU cycles of initializing the TDX module (and the
metadata) when TDX is not used at all.

2) The TDX module design allows it to be updated while the system is
running.  The update procedure shares quite a few steps with this "on
demand" initialization mechanism.  The hope is that much of "on demand"
mechanism can be shared with a future "update" mechanism.  A boot-time
TDX module implementation would not be able to share much code with the
update mechanism.

3) Making SEAMCALL requires VMX to be enabled.  Currently, only the KVM
code mucks with VMX enabling.  If the TDX module were to be initialized
separately from KVM (like at boot), the boot code would need to be
taught how to muck with VMX enabling and KVM would need to be taught how
to cope with that.  Making KVM itself responsible for TDX initialization
lets the rest of the kernel stay blissfully unaware of VMX.

Similar to module initialization, also make the per-cpu initialization
"on demand" as it also depends on VMX being enabled.

Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
module and enable TDX on local cpu respectively.  For now tdx_enable()
is a placeholder.  The TODO list will be pared down as functionality is
added.

Export both tdx_cpu_enable() and tdx_enable() for KVM use.

In tdx_enable() use a state machine protected by mutex to make sure the
initialization will only be done once, as tdx_enable() can be called
multiple times (i.e. KVM module can be reloaded) and may be called
concurrently by other kernel components in the future.

The per-cpu initialization on each cpu can only be done once during the
module's life time.  Use a per-cpu variable to track its status to make
sure it is only done once in tdx_cpu_enable().

Also, a SEAMCALL to do TDX module global initialization must be done
once on any logical cpu before any per-cpu initialization SEAMCALL.  Do
it inside tdx_cpu_enable() too (if hasn't been done).

tdx_enable() can potentially invoke SEAMCALLs on any online cpus.  The
per-cpu initialization must be done before those SEAMCALLs are invoked
on some cpu.  To keep things simple, in tdx_cpu_enable(), always do the
per-cpu initialization regardless of whether the TDX module has been
initialized or not.  And in tdx_enable(), don't call tdx_cpu_enable()
but assume the caller has disabled CPU hotplug, done VMXON and
tdx_cpu_enable() on all online cpus before calling tdx_enable().

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v11 -> v12:
 - Simplified TDX module global init and lp init status tracking (David).
 - Added comment around try_init_module_global() for using
   raw_spin_lock() (Dave).
 - Added one sentence to changelog to explain why to expose tdx_enable()
   and tdx_cpu_enable() (Dave).
 - Simplifed comments around tdx_enable() and tdx_cpu_enable() to use
   lockdep_assert_*() instead. (Dave)
 - Removed redundent "TDX" in error message (Dave).

v10 -> v11:
 - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off.
 - Return the actual error code for tdx_enable() instead of -EINVAL.
 - Added Isaku's Reviewed-by.

v9 -> v10:
 - Merged the patch to handle per-cpu initialization to this patch to
   tell the story better.
 - Changed how to handle the per-cpu initialization to only provide a
   tdx_cpu_enable() function to let the user of TDX to do it when the
   user wants to run TDX code on a certain cpu.
 - Changed tdx_enable() to not call cpus_read_lock() explicitly, but
   call lockdep_assert_cpus_held() to assume the caller has done that.
 - Improved comments around tdx_enable() and tdx_cpu_enable().
 - Improved changelog to tell the story better accordingly.

v8 -> v9:
 - Removed detailed TODO list in the changelog (Dave).
 - Added back steps to do module global initialization and per-cpu
   initialization in the TODO list comment.
 - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h

v7 -> v8:
 - Refined changelog (Dave).
 - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
 - Add a "TODO list" comment in init_tdx_module() to list all steps of
   initializing the TDX Module to tell the story (Dave).
 - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
   comments (Dave).
 - Simplified __tdx_enable() to only handle success or failure.
 - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
 - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
 - Improved comments (Dave).
 - Pointed out 'tdx_module_status' is software thing (Dave).

v6 -> v7:
 - No change.

v5 -> v6:
 - Added code to set status to TDX_MODULE_NONE if TDX module is not
   loaded (Chao)
 - Added Chao's Reviewed-by.
 - Improved comments around cpus_read_lock().

- v3->v5 (no feedback on v4):
 - Removed the check that SEAMRR and TDX KeyID have been detected on
   all present cpus.
 - Removed tdx_detect().
 - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
   hotplug lock and return early with error message.
 - Improved dmesg printing for TDX module detection and initialization.


---
 arch/x86/include/asm/tdx.h  |   4 +
 arch/x86/virt/vmx/tdx/tdx.c | 162 ++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  13 +++
 3 files changed, 179 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4dfe2e794411..d8226a50c58c 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -97,8 +97,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 
 #ifdef CONFIG_INTEL_TDX_HOST
 bool platform_tdx_enabled(void);
+int tdx_cpu_enable(void);
+int tdx_enable(void);
 #else	/* !CONFIG_INTEL_TDX_HOST */
 static inline bool platform_tdx_enabled(void) { return false; }
+static inline int tdx_cpu_enable(void) { return -ENODEV; }
+static inline int tdx_enable(void)  { return -ENODEV; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 141d12376c4d..29ca18f66d61 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -13,6 +13,10 @@
 #include <linux/errno.h>
 #include <linux/printk.h>
 #include <linux/smp.h>
+#include <linux/cpu.h>
+#include <linux/spinlock.h>
+#include <linux/percpu-defs.h>
+#include <linux/mutex.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/archrandom.h>
@@ -23,6 +27,13 @@ static u32 tdx_global_keyid __ro_after_init;
 static u32 tdx_guest_keyid_start __ro_after_init;
 static u32 tdx_nr_guest_keyids __ro_after_init;
 
+static bool tdx_global_initialized;
+static DEFINE_RAW_SPINLOCK(tdx_global_init_lock);
+static DEFINE_PER_CPU(bool, tdx_lp_initialized);
+
+static enum tdx_module_status_t tdx_module_status;
+static DEFINE_MUTEX(tdx_module_lock);
+
 /*
  * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
  * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
@@ -74,6 +85,157 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	}
 }
 
+/*
+ * Do the module global initialization if not done yet.
+ * It's always called with interrupts and preemption disabled.
+ */
+static int try_init_module_global(void)
+{
+	unsigned long flags;
+	int ret;
+
+	/*
+	 * The TDX module global initialization only needs to be done
+	 * once on any cpu.
+	 */
+	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
+
+	if (tdx_global_initialized) {
+		ret = 0;
+		goto out;
+	}
+
+	/* All '0's are just unused parameters. */
+	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
+	if (!ret)
+		tdx_global_initialized = true;
+out:
+	raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
+
+	return ret;
+}
+
+/**
+ * tdx_cpu_enable - Enable TDX on local cpu
+ *
+ * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
+ * global initialization SEAMCALL if not done) on local cpu to make this
+ * cpu be ready to run any other SEAMCALLs.
+ *
+ * Call this function with preemption disabled.
+ *
+ * Return 0 on success, otherwise errors.
+ */
+int tdx_cpu_enable(void)
+{
+	int ret;
+
+	if (!platform_tdx_enabled())
+		return -ENODEV;
+
+	lockdep_assert_preemption_disabled();
+
+	/* Already done */
+	if (__this_cpu_read(tdx_lp_initialized))
+		return 0;
+
+	/*
+	 * The TDX module global initialization is the very first step
+	 * to enable TDX.  Need to do it first (if hasn't been done)
+	 * before the per-cpu initialization.
+	 */
+	ret = try_init_module_global();
+	if (ret)
+		return ret;
+
+	/* All '0's are just unused parameters */
+	ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
+	if (ret)
+		return ret;
+
+	__this_cpu_write(tdx_lp_initialized, true);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(tdx_cpu_enable);
+
+static int init_tdx_module(void)
+{
+	/*
+	 * TODO:
+	 *
+	 *  - Get TDX module information and TDX-capable memory regions.
+	 *  - Build the list of TDX-usable memory regions.
+	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
+	 *    all TDX-usable memory regions.
+	 *  - Configure the TDMRs and the global KeyID to the TDX module.
+	 *  - Configure the global KeyID on all packages.
+	 *  - Initialize all TDMRs.
+	 *
+	 *  Return error before all steps are done.
+	 */
+	return -EINVAL;
+}
+
+static int __tdx_enable(void)
+{
+	int ret;
+
+	ret = init_tdx_module();
+	if (ret) {
+		pr_err("module initialization failed (%d)\n", ret);
+		tdx_module_status = TDX_MODULE_ERROR;
+		return ret;
+	}
+
+	pr_info("module initialized.\n");
+	tdx_module_status = TDX_MODULE_INITIALIZED;
+
+	return 0;
+}
+
+/**
+ * tdx_enable - Enable TDX module to make it ready to run TDX guests
+ *
+ * This function assumes the caller has: 1) held read lock of CPU hotplug
+ * lock to prevent any new cpu from becoming online; 2) done both VMXON
+ * and tdx_cpu_enable() on all online cpus.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return 0 if TDX is enabled successfully, otherwise error.
+ */
+int tdx_enable(void)
+{
+	int ret;
+
+	if (!platform_tdx_enabled())
+		return -ENODEV;
+
+	lockdep_assert_cpus_held();
+
+	mutex_lock(&tdx_module_lock);
+
+	switch (tdx_module_status) {
+	case TDX_MODULE_UNKNOWN:
+		ret = __tdx_enable();
+		break;
+	case TDX_MODULE_INITIALIZED:
+		/* Already initialized, great, tell the caller. */
+		ret = 0;
+		break;
+	default:
+		/* Failed to initialize in the previous attempts */
+		ret = -EINVAL;
+		break;
+	}
+
+	mutex_unlock(&tdx_module_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_enable);
+
 static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
 					    u32 *nr_tdx_keyids)
 {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 55dbb1b8c971..9fb46033c852 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -16,11 +16,24 @@
  */
 #define TDX_RND_NO_ENTROPY	0x8000020300000000ULL
 
+/*
+ * TDX module SEAMCALL leaf functions
+ */
+#define TDH_SYS_INIT		33
+#define TDH_SYS_LP_INIT		35
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!
  */
 
+/* Kernel defined TDX module status during module initialization. */
+enum tdx_module_status_t {
+	TDX_MODULE_UNKNOWN,
+	TDX_MODULE_INITIALIZED,
+	TDX_MODULE_ERROR
+};
+
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	       struct tdx_module_output *out);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (6 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-27  9:51   ` kirill.shutemov
  2023-06-28 14:10   ` Peter Zijlstra
  2023-06-26 14:12 ` [PATCH v12 09/22] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
                   ` (14 subsequent siblings)
  22 siblings, 2 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

Start to transit out the "multi-steps" to initialize the TDX module.

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.

CMRs tell the kernel which memory is TDX compatible.  The kernel takes
CMRs (plus a little more metadata) and constructs "TD Memory Regions"
(TDMRs).  TDMRs let the kernel grant TDX protections to some or all of
the CMR areas.

The TDX module also reports necessary information to let the kernel
build TDMRs and run TDX guests in structure 'tdsysinfo_struct'.  The
list of CMRs, along with the TDX module information, is available to
the kernel by querying the TDX module.

As a preparation to construct TDMRs, get the TDX module information and
the list of CMRs.  Print out CMRs to help user to decode which memory
regions are TDX convertible.

The 'tdsysinfo_struct' is fairly large (1024 bytes) and contains a lot
of info about the TDX module.  Fully define the entire structure, but
only use the fields necessary to build the TDMRs and pr_info() some
basics about the module.  The rest of the fields will get used by KVM.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
---

v11 -> v12:
 - Changed to use dynamic allocation for TDSYSINFO_STRUCT and CMR array
   (Kirill).
 - Keep SEAMCALL leaf macro definitions in order (Kirill)
 - Removed is_cmr_empty() but open code directly (David)
 - 'atribute' -> 'attribute' (David)

v10 -> v11:
 - No change.

v9 -> v10:
 - Added back "start to transit out..." as now per-cpu init has been
   moved out from tdx_enable().

v8 -> v9:
 - Removed "start to trransit out ..." part in changelog since this patch
   is no longer the first step anymore.
 - Changed to declare 'tdsysinfo' and 'cmr_array' as local static, and
   changed changelog accordingly (Dave).
 - Improved changelog to explain why to declare  'tdsysinfo_struct' in
   full but only use a few members of them (Dave).

v7 -> v8: (Dave)
 - Improved changelog to tell this is the first patch to transit out the
   "multi-steps" init_tdx_module().
 - Removed all CMR check/trim code but to depend on later SEAMCALL.
 - Variable 'vertical alignment' in print TDX module information.
 - Added DECLARE_PADDED_STRUCT() for padded structure.
 - Made tdx_sysinfo and tdx_cmr_array[] to be function local variable
   (and rename them accordingly), and added -Wframe-larger-than=4096 flag
   to silence the build warning.

v6 -> v7:
 - Simplified the check of CMRs due to the fact that TDX actually
   verifies CMRs (that are passed by the BIOS) before enabling TDX.
 - Changed the function name from check_cmrs() -> trim_empty_cmrs().
 - Added CMR page aligned check so that later patch can just get the PFN
   using ">> PAGE_SHIFT".

v5 -> v6:
 - Added to also print TDX module's attribute (Isaku).
 - Removed all arguments in tdx_gete_sysinfo() to use static variables
   of 'tdx_sysinfo' and 'tdx_cmr_array' directly as they are all used
   directly in other functions in later patches.
 - Added Isaku's Reviewed-by.

- v3 -> v5 (no feedback on v4):
 - Renamed sanitize_cmrs() to check_cmrs().
 - Removed unnecessary sanity check against tdx_sysinfo and tdx_cmr_array
   actual size returned by TDH.SYS.INFO.
 - Changed -EFAULT to -EINVAL in couple places.
 - Added comments around tdx_sysinfo and tdx_cmr_array saying they are
   used by TDH.SYS.INFO ABI.
 - Changed to pass 'tdx_sysinfo' and 'tdx_cmr_array' as function
   arguments in tdx_get_sysinfo().
 - Changed to only print BIOS-CMR when check_cmrs() fails.


---
 arch/x86/virt/vmx/tdx/tdx.c | 79 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h | 60 ++++++++++++++++++++++++++++
 2 files changed, 137 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 29ca18f66d61..a2129cbe056e 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -20,6 +20,7 @@
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/archrandom.h>
+#include <asm/page.h>
 #include <asm/tdx.h>
 #include "tdx.h"
 
@@ -159,12 +160,79 @@ int tdx_cpu_enable(void)
 }
 EXPORT_SYMBOL_GPL(tdx_cpu_enable);
 
+static void print_cmrs(struct cmr_info *cmr_array, int nr_cmrs)
+{
+	int i;
+
+	for (i = 0; i < nr_cmrs; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+
+		/*
+		 * The array of CMRs reported via TDH.SYS.INFO can
+		 * contain tail empty CMRs.  Don't print them.
+		 */
+		if (!cmr->size)
+			break;
+
+		pr_info("CMR: [0x%llx, 0x%llx)\n", cmr->base,
+				cmr->base + cmr->size);
+	}
+}
+
+static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
+			   struct cmr_info *cmr_array)
+{
+	struct tdx_module_output out;
+	u64 sysinfo_pa, cmr_array_pa;
+	int ret;
+
+	sysinfo_pa = __pa(sysinfo);
+	cmr_array_pa = __pa(cmr_array);
+	ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
+			cmr_array_pa, MAX_CMRS, NULL, &out);
+	if (ret)
+		return ret;
+
+	pr_info("TDX module: attributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
+		sysinfo->attributes,	sysinfo->vendor_id,
+		sysinfo->major_version, sysinfo->minor_version,
+		sysinfo->build_date,	sysinfo->build_num);
+
+	/* R9 contains the actual entries written to the CMR array. */
+	print_cmrs(cmr_array, out.r9);
+
+	return 0;
+}
+
 static int init_tdx_module(void)
 {
+	struct tdsysinfo_struct *sysinfo;
+	struct cmr_info *cmr_array;
+	int ret;
+
+	/*
+	 * Get the TDSYSINFO_STRUCT and CMRs from the TDX module.
+	 *
+	 * The buffers of the TDSYSINFO_STRUCT and the CMR array passed
+	 * to the TDX module must be 1024-bytes and 512-bytes aligned
+	 * respectively.  Allocate one page to accommodate them both and
+	 * also meet those alignment requirements.
+	 */
+	sysinfo = (struct tdsysinfo_struct *)__get_free_page(GFP_KERNEL);
+	if (!sysinfo)
+		return -ENOMEM;
+	cmr_array = (struct cmr_info *)((unsigned long)sysinfo + PAGE_SIZE / 2);
+
+	BUILD_BUG_ON(PAGE_SIZE / 2 < TDSYSINFO_STRUCT_SIZE);
+	BUILD_BUG_ON(PAGE_SIZE / 2 < sizeof(struct cmr_info) * MAX_CMRS);
+
+	ret = tdx_get_sysinfo(sysinfo, cmr_array);
+	if (ret)
+		goto out;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Get TDX module information and TDX-capable memory regions.
 	 *  - Build the list of TDX-usable memory regions.
 	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
 	 *    all TDX-usable memory regions.
@@ -174,7 +242,14 @@ static int init_tdx_module(void)
 	 *
 	 *  Return error before all steps are done.
 	 */
-	return -EINVAL;
+	ret = -EINVAL;
+out:
+	/*
+	 * For now both @sysinfo and @cmr_array are only used during
+	 * module initialization, so always free them.
+	 */
+	free_page((unsigned long)sysinfo);
+	return ret;
 }
 
 static int __tdx_enable(void)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 9fb46033c852..8ab2d40971ea 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -3,6 +3,8 @@
 #define _X86_VIRT_TDX_H
 
 #include <linux/types.h>
+#include <linux/stddef.h>
+#include <linux/compiler_attributes.h>
 
 /*
  * This file contains both macros and data structures defined by the TDX
@@ -19,9 +21,67 @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
 
+struct cmr_info {
+	u64	base;
+	u64	size;
+} __packed;
+
+#define MAX_CMRS	32
+
+struct cpuid_config {
+	u32	leaf;
+	u32	sub_leaf;
+	u32	eax;
+	u32	ebx;
+	u32	ecx;
+	u32	edx;
+} __packed;
+
+#define TDSYSINFO_STRUCT_SIZE		1024
+
+/*
+ * The size of this structure itself is flexible.  The actual structure
+ * passed to TDH.SYS.INFO must be padded to 1024 bytes and be 1204-bytes
+ * aligned.
+ */
+struct tdsysinfo_struct {
+	/* TDX-SEAM Module Info */
+	u32	attributes;
+	u32	vendor_id;
+	u32	build_date;
+	u16	build_num;
+	u16	minor_version;
+	u16	major_version;
+	u8	reserved0[14];
+	/* Memory Info */
+	u16	max_tdmrs;
+	u16	max_reserved_per_tdmr;
+	u16	pamt_entry_size;
+	u8	reserved1[10];
+	/* Control Struct Info */
+	u16	tdcs_base_size;
+	u8	reserved2[2];
+	u16	tdvps_base_size;
+	u8	tdvps_xfam_dependent_size;
+	u8	reserved3[9];
+	/* TD Capabilities */
+	u64	attributes_fixed0;
+	u64	attributes_fixed1;
+	u64	xfam_fixed0;
+	u64	xfam_fixed1;
+	u8	reserved4[32];
+	u32	num_cpuid_config;
+	/*
+	 * The actual number of CPUID_CONFIG depends on above
+	 * 'num_cpuid_config'.
+	 */
+	DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
+} __packed;
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 09/22] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (7 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-28 14:17   ` Peter Zijlstra
  2023-07-11 11:38   ` David Hildenbrand
  2023-06-26 14:12 ` [PATCH v12 10/22] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Kai Huang
                   ` (13 subsequent siblings)
  22 siblings, 2 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

As a step of initializing the TDX module, the kernel needs to tell the
TDX module which memory regions can be used by the TDX module as TDX
guest memory.

TDX reports a list of "Convertible Memory Region" (CMR) to tell the
kernel which memory is TDX compatible.  The kernel needs to build a list
of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
the TDX module.  Once this is done, those "TDX-usable" memory regions
are fixed during module's lifetime.

To keep things simple, assume that all TDX-protected memory will come
from the page allocator.  Make sure all pages in the page allocator
*are* TDX-usable memory.

As TDX-usable memory is a fixed configuration, take a snapshot of the
memory configuration from memblocks at the time of module initialization
(memblocks are modified on memory hotplug).  This snapshot is used to
enable TDX support for *this* memory configuration only.  Use a memory
hotplug notifier to ensure that no other RAM can be added outside of
this configuration.

This approach requires all memblock memory regions at the time of module
initialization to be TDX convertible memory to work, otherwise module
initialization will fail in a later SEAMCALL when passing those regions
to the module.  This approach works when all boot-time "system RAM" is
TDX convertible memory, and no non-TDX-convertible memory is hot-added
to the core-mm before module initialization.

For instance, on the first generation of TDX machines, both CXL memory
and NVDIMM are not TDX convertible memory.  Using kmem driver to hot-add
any CXL memory or NVDIMM to the core-mm before module initialization
will result in failure to initialize the module.  The SEAMCALL error
code will be available in the dmesg to help user to understand the
failure.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v11 -> v12:
 - Added tags from Dave/Kirill.

v10 -> v11:
 - Added Isaku's Reviewed-by.

v9 -> v10:
 - Moved empty @tdx_memlist check out of is_tdx_memory() to make the
   logic better.
 - Added Ying's Reviewed-by.

v8 -> v9:
 - Replace "The initial support ..." with timeless sentence in both
   changelog and comments(Dave).
 - Fix run-on sentence in changelog, and senstence to explain why to
   stash off memblock (Dave).
 - Tried to improve why to choose this approach and how it work in
   changelog based on Dave's suggestion.
 - Many other comments enhancement (Dave).

v7 -> v8:
 - Trimed down changelog (Dave).
 - Changed to use PHYS_PFN() and PFN_PHYS() throughout this series
   (Ying).
 - Moved memory hotplug handling from add_arch_memory() to
   memory_notifier (Dan/David).
 - Removed 'nid' from 'struct tdx_memblock' to later patch (Dave).
 - {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave).
 - Removed pfn_covered_by_cmr() check as no code to trim CMRs now.
 - Improve the comment around first 1MB (Dave).
 - Added a comment around reserve_real_mode() to point out TDX code
   relies on first 1MB being reserved (Ying).
 - Added comment to explain why the new online memory range cannot
   cross multiple TDX memory blocks (Dave).
 - Improved other comments (Dave).


---
 arch/x86/Kconfig            |   1 +
 arch/x86/kernel/setup.c     |   2 +
 arch/x86/virt/vmx/tdx/tdx.c | 161 +++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |   6 ++
 4 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f0f3f1a2c8e0..2226d8a4c749 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1958,6 +1958,7 @@ config INTEL_TDX_HOST
 	depends on X86_64
 	depends on KVM_INTEL
 	depends on X86_X2APIC
+	select ARCH_KEEP_MEMBLOCK
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 16babff771bd..fd94f8186b9c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1159,6 +1159,8 @@ void __init setup_arch(char **cmdline_p)
 	 *
 	 * Moreover, on machines with SandyBridge graphics or in setups that use
 	 * crashkernel the entire 1M is reserved anyway.
+	 *
+	 * Note the host kernel TDX also requires the first 1MB being reserved.
 	 */
 	x86_platform.realmode_reserve();
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index a2129cbe056e..127036f06752 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -17,6 +17,13 @@
 #include <linux/spinlock.h>
 #include <linux/percpu-defs.h>
 #include <linux/mutex.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/memblock.h>
+#include <linux/memory.h>
+#include <linux/minmax.h>
+#include <linux/sizes.h>
+#include <linux/pfn.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/archrandom.h>
@@ -35,6 +42,9 @@ static DEFINE_PER_CPU(bool, tdx_lp_initialized);
 static enum tdx_module_status_t tdx_module_status;
 static DEFINE_MUTEX(tdx_module_lock);
 
+/* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
+static LIST_HEAD(tdx_memlist);
+
 /*
  * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
  * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
@@ -204,6 +214,79 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
 	return 0;
 }
 
+/*
+ * Add a memory region as a TDX memory block.  The caller must make sure
+ * all memory regions are added in address ascending order and don't
+ * overlap.
+ */
+static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
+			    unsigned long end_pfn)
+{
+	struct tdx_memblock *tmb;
+
+	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
+	if (!tmb)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&tmb->list);
+	tmb->start_pfn = start_pfn;
+	tmb->end_pfn = end_pfn;
+
+	/* @tmb_list is protected by mem_hotplug_lock */
+	list_add_tail(&tmb->list, tmb_list);
+	return 0;
+}
+
+static void free_tdx_memlist(struct list_head *tmb_list)
+{
+	/* @tmb_list is protected by mem_hotplug_lock */
+	while (!list_empty(tmb_list)) {
+		struct tdx_memblock *tmb = list_first_entry(tmb_list,
+				struct tdx_memblock, list);
+
+		list_del(&tmb->list);
+		kfree(tmb);
+	}
+}
+
+/*
+ * Ensure that all memblock memory regions are convertible to TDX
+ * memory.  Once this has been established, stash the memblock
+ * ranges off in a secondary structure because memblock is modified
+ * in memory hotplug while TDX memory regions are fixed.
+ */
+static int build_tdx_memlist(struct list_head *tmb_list)
+{
+	unsigned long start_pfn, end_pfn;
+	int i, ret;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+		/*
+		 * The first 1MB is not reported as TDX convertible memory.
+		 * Although the first 1MB is always reserved and won't end up
+		 * to the page allocator, it is still in memblock's memory
+		 * regions.  Skip them manually to exclude them as TDX memory.
+		 */
+		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
+		if (start_pfn >= end_pfn)
+			continue;
+
+		/*
+		 * Add the memory regions as TDX memory.  The regions in
+		 * memblock has already guaranteed they are in address
+		 * ascending order and don't overlap.
+		 */
+		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	free_tdx_memlist(tmb_list);
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	struct tdsysinfo_struct *sysinfo;
@@ -230,10 +313,25 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/*
+	 * To keep things simple, assume that all TDX-protected memory
+	 * will come from the page allocator.  Make sure all pages in the
+	 * page allocator are TDX-usable memory.
+	 *
+	 * Build the list of "TDX-usable" memory regions which cover all
+	 * pages in the page allocator to guarantee that.  Do it while
+	 * holding mem_hotplug_lock read-lock as the memory hotplug code
+	 * path reads the @tdx_memlist to reject any new memory.
+	 */
+	get_online_mems();
+
+	ret = build_tdx_memlist(&tdx_memlist);
+	if (ret)
+		goto out_put_tdxmem;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Build the list of TDX-usable memory regions.
 	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
 	 *    all TDX-usable memory regions.
 	 *  - Configure the TDMRs and the global KeyID to the TDX module.
@@ -243,6 +341,12 @@ static int init_tdx_module(void)
 	 *  Return error before all steps are done.
 	 */
 	ret = -EINVAL;
+out_put_tdxmem:
+	/*
+	 * @tdx_memlist is written here and read at memory hotplug time.
+	 * Lock out memory hotplug code while building it.
+	 */
+	put_online_mems();
 out:
 	/*
 	 * For now both @sysinfo and @cmr_array are only used during
@@ -339,6 +443,54 @@ static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
 	return 0;
 }
 
+static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
+{
+	struct tdx_memblock *tmb;
+
+	/*
+	 * This check assumes that the start_pfn<->end_pfn range does not
+	 * cross multiple @tdx_memlist entries.  A single memory online
+	 * event across multiple memblocks (from which @tdx_memlist
+	 * entries are derived at the time of module initialization) is
+	 * not possible.  This is because memory offline/online is done
+	 * on granularity of 'struct memory_block', and the hotpluggable
+	 * memory region (one memblock) must be multiple of memory_block.
+	 */
+	list_for_each_entry(tmb, &tdx_memlist, list) {
+		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
+			return true;
+	}
+	return false;
+}
+
+static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
+			       void *v)
+{
+	struct memory_notify *mn = v;
+
+	if (action != MEM_GOING_ONLINE)
+		return NOTIFY_OK;
+
+	/*
+	 * Empty list means TDX isn't enabled.  Allow any memory
+	 * to go online.
+	 */
+	if (list_empty(&tdx_memlist))
+		return NOTIFY_OK;
+
+	/*
+	 * The TDX memory configuration is static and can not be
+	 * changed.  Reject onlining any memory which is outside of
+	 * the static configuration whether it supports TDX or not.
+	 */
+	return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ?
+		NOTIFY_OK : NOTIFY_BAD;
+}
+
+static struct notifier_block tdx_memory_nb = {
+	.notifier_call = tdx_memory_notifier,
+};
+
 static int __init tdx_init(void)
 {
 	u32 tdx_keyid_start, nr_tdx_keyids;
@@ -362,6 +514,13 @@ static int __init tdx_init(void)
 		return -ENODEV;
 	}
 
+	err = register_memory_notifier(&tdx_memory_nb);
+	if (err) {
+		pr_info("initialization failed: register_memory_notifier() failed (%d)\n",
+				err);
+		return -ENODEV;
+	}
+
 	/*
 	 * Just use the first TDX KeyID as the 'global KeyID' and
 	 * leave the rest for TDX guests.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 8ab2d40971ea..37ee7c5dce1c 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -94,6 +94,12 @@ enum tdx_module_status_t {
 	TDX_MODULE_ERROR
 };
 
+struct tdx_memblock {
+	struct list_head list;
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+};
+
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	       struct tdx_module_output *out);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 10/22] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (8 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 09/22] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-26 14:12 ` [PATCH v12 11/22] x86/virt/tdx: Fill out " Kai Huang
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

After the kernel selects all TDX-usable memory regions, the kernel needs
to pass those regions to the TDX module via data structure "TD Memory
Region" (TDMR).

Add a placeholder to construct a list of TDMRs (in multiple steps) to
cover all TDX-usable memory regions.

=== Long Version ===

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.  The list of these
ranges is available to the kernel by querying the TDX module.

The TDX architecture needs additional metadata to record things like
which TD guest "owns" a given page of memory.  This metadata essentially
serves as the 'struct page' for the TDX module.  The space for this
metadata is not reserved by the hardware up front and must be allocated
by the kernel and given to the TDX module.

Since this metadata consumes space, the VMM can choose whether or not to
allocate it for a given area of convertible memory.  If it chooses not
to, the memory cannot receive TDX protections and can not be used by TDX
guests as private memory.

For every memory region that the VMM wants to use as TDX memory, it sets
up a "TD Memory Region" (TDMR).  Each TDMR represents a physically
contiguous convertible range and must also have its own physically
contiguous metadata table, referred to as a Physical Address Metadata
Table (PAMT), to track status for each page in the TDMR range.

Unlike a CMR, each TDMR requires 1G granularity and alignment.  To
support physical RAM areas that don't meet those strict requirements,
each TDMR permits a number of internal "reserved areas" which can be
placed over memory holes.  If PAMT metadata is placed within a TDMR it
must be covered by one of these reserved areas.

Let's summarize the concepts:

 CMR - Firmware-enumerated physical ranges that support TDX.  CMRs are
       4K aligned.
TDMR - Physical address range which is chosen by the kernel to support
       TDX.  1G granularity and alignment required.  Each TDMR has
       reserved areas where TDX memory holes and overlapping PAMTs can
       be represented.
PAMT - Physically contiguous TDX metadata.  One table for each page size
       per TDMR.  Roughly 1/256th of TDMR in size.  256G TDMR = ~1G
       PAMT.

As one step of initializing the TDX module, the kernel configures
TDX-usable memory regions by passing a list of TDMRs to the TDX module.

Constructing the list of TDMRs consists below steps:

1) Fill out TDMRs to cover all memory regions that the TDX module will
   use for TD memory.
2) Allocate and set up PAMT for each TDMR.
3) Designate reserved areas for each TDMR.

Add a placeholder to construct TDMRs to do the above steps.  To keep
things simple, just allocate enough space to hold maximum number of
TDMRs up front.  Always free the buffer of TDMRs since they are only
used during module initialization.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v11 -> v12:
 - Added tags from Dave/Kirill.

v10 -> v11:
 - Changed to keep TDMRs after module initialization to deal with TDX
   erratum in future patches. 

v9 -> v10:
 - Changed the TDMR list from static variable back to local variable as
   now TDX module isn't disabled when tdx_cpu_enable() fails.

v8 -> v9:
 - Changes around 'struct tdmr_info_list' (Dave):
   - Moved the declaration from tdx.c to tdx.h.
   - Renamed 'first_tdmr' to 'tdmrs'.
   - 'nr_tdmrs' -> 'nr_consumed_tdmrs'.
   - Changed 'tdmrs' to 'void *'.
   - Improved comments for all structure members.
 - Added a missing empty line in alloc_tdmr_list() (Dave).

v7 -> v8:
 - Improved changelog to tell this is one step of "TODO list" in
   init_tdx_module().
 - Other changelog improvement suggested by Dave (with "Create TDMRs" to
   "Fill out TDMRs" to align with the code).
 - Added a "TODO list" comment to lay out the steps to construct TDMRs,
   following the same idea of "TODO list" in tdx_module_init().
 - Introduced 'struct tdmr_info_list' (Dave)
 - Further added additional members (tdmr_sz/max_tdmrs/nr_tdmrs) to
   simplify getting TDMR by given index, and reduce passing arguments
   around functions.
 - Added alloc_tdmr_list()/free_tdmr_list() accordingly, which internally
   uses tdmr_size_single() (Dave).
 - tdmr_num -> nr_tdmrs (Dave).

v6 -> v7:
 - Improved commit message to explain 'int' overflow cannot happen
   in cal_tdmr_size() and alloc_tdmr_array(). -- Andy/Dave.

v5 -> v6:
 - construct_tdmrs_memblock() -> construct_tdmrs() as 'tdx_memblock' is
   used instead of memblock.
 - Added Isaku's Reviewed-by.

- v3 -> v5 (no feedback on v4):
 - Moved calculating TDMR size to this patch.
 - Changed to use alloc_pages_exact() to allocate buffer for all TDMRs
   once, instead of allocating each TDMR individually.
 - Removed "crypto protection" in the changelog.
 - -EFAULT -> -EINVAL in couple of places.


---
 arch/x86/virt/vmx/tdx/tdx.c | 97 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h | 32 ++++++++++++
 2 files changed, 127 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 127036f06752..e28615b60f9b 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -24,6 +24,7 @@
 #include <linux/minmax.h>
 #include <linux/sizes.h>
 #include <linux/pfn.h>
+#include <linux/align.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/archrandom.h>
@@ -287,9 +288,84 @@ static int build_tdx_memlist(struct list_head *tmb_list)
 	return ret;
 }
 
+/* Calculate the actual TDMR size */
+static int tdmr_size_single(u16 max_reserved_per_tdmr)
+{
+	int tdmr_sz;
+
+	/*
+	 * The actual size of TDMR depends on the maximum
+	 * number of reserved areas.
+	 */
+	tdmr_sz = sizeof(struct tdmr_info);
+	tdmr_sz += sizeof(struct tdmr_reserved_area) * max_reserved_per_tdmr;
+
+	return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
+}
+
+static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list,
+			   struct tdsysinfo_struct *sysinfo)
+{
+	size_t tdmr_sz, tdmr_array_sz;
+	void *tdmr_array;
+
+	tdmr_sz = tdmr_size_single(sysinfo->max_reserved_per_tdmr);
+	tdmr_array_sz = tdmr_sz * sysinfo->max_tdmrs;
+
+	/*
+	 * To keep things simple, allocate all TDMRs together.
+	 * The buffer needs to be physically contiguous to make
+	 * sure each TDMR is physically contiguous.
+	 */
+	tdmr_array = alloc_pages_exact(tdmr_array_sz,
+			GFP_KERNEL | __GFP_ZERO);
+	if (!tdmr_array)
+		return -ENOMEM;
+
+	tdmr_list->tdmrs = tdmr_array;
+
+	/*
+	 * Keep the size of TDMR to find the target TDMR
+	 * at a given index in the TDMR list.
+	 */
+	tdmr_list->tdmr_sz = tdmr_sz;
+	tdmr_list->max_tdmrs = sysinfo->max_tdmrs;
+	tdmr_list->nr_consumed_tdmrs = 0;
+
+	return 0;
+}
+
+static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
+{
+	free_pages_exact(tdmr_list->tdmrs,
+			tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
+}
+
+/*
+ * Construct a list of TDMRs on the preallocated space in @tdmr_list
+ * to cover all TDX memory regions in @tmb_list based on the TDX module
+ * information in @sysinfo.
+ */
+static int construct_tdmrs(struct list_head *tmb_list,
+			   struct tdmr_info_list *tdmr_list,
+			   struct tdsysinfo_struct *sysinfo)
+{
+	/*
+	 * TODO:
+	 *
+	 *  - Fill out TDMRs to cover all TDX memory regions.
+	 *  - Allocate and set up PAMTs for each TDMR.
+	 *  - Designate reserved areas for each TDMR.
+	 *
+	 * Return -EINVAL until constructing TDMRs is done
+	 */
+	return -EINVAL;
+}
+
 static int init_tdx_module(void)
 {
 	struct tdsysinfo_struct *sysinfo;
+	struct tdmr_info_list tdmr_list;
 	struct cmr_info *cmr_array;
 	int ret;
 
@@ -329,11 +405,19 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_put_tdxmem;
 
+	/* Allocate enough space for constructing TDMRs */
+	ret = alloc_tdmr_list(&tdmr_list, sysinfo);
+	if (ret)
+		goto out_free_tdxmem;
+
+	/* Cover all TDX-usable memory regions in TDMRs */
+	ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo);
+	if (ret)
+		goto out_free_tdmrs;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
-	 *    all TDX-usable memory regions.
 	 *  - Configure the TDMRs and the global KeyID to the TDX module.
 	 *  - Configure the global KeyID on all packages.
 	 *  - Initialize all TDMRs.
@@ -341,6 +425,15 @@ static int init_tdx_module(void)
 	 *  Return error before all steps are done.
 	 */
 	ret = -EINVAL;
+out_free_tdmrs:
+	/*
+	 * Always free the buffer of TDMRs as they are only used during
+	 * module initialization.
+	 */
+	free_tdmr_list(&tdmr_list);
+out_free_tdxmem:
+	if (ret)
+		free_tdx_memlist(&tdx_memlist);
 out_put_tdxmem:
 	/*
 	 * @tdx_memlist is written here and read at memory hotplug time.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 37ee7c5dce1c..193764afc602 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -82,6 +82,29 @@ struct tdsysinfo_struct {
 	DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
 } __packed;
 
+struct tdmr_reserved_area {
+	u64 offset;
+	u64 size;
+} __packed;
+
+#define TDMR_INFO_ALIGNMENT	512
+
+struct tdmr_info {
+	u64 base;
+	u64 size;
+	u64 pamt_1g_base;
+	u64 pamt_1g_size;
+	u64 pamt_2m_base;
+	u64 pamt_2m_size;
+	u64 pamt_4k_base;
+	u64 pamt_4k_size;
+	/*
+	 * Actual number of reserved areas depends on
+	 * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
+	 */
+	DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas);
+} __packed __aligned(TDMR_INFO_ALIGNMENT);
+
 /*
  * Do not put any hardware-defined TDX structure representations below
  * this comment!
@@ -100,6 +123,15 @@ struct tdx_memblock {
 	unsigned long end_pfn;
 };
 
+struct tdmr_info_list {
+	void *tdmrs;	/* Flexible array to hold 'tdmr_info's */
+	int nr_consumed_tdmrs;	/* How many 'tdmr_info's are in use */
+
+	/* Metadata for finding target 'tdmr_info' and freeing @tdmrs */
+	int tdmr_sz;	/* Size of one 'tdmr_info' */
+	int max_tdmrs;	/* How many 'tdmr_info's are allocated */
+};
+
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	       struct tdx_module_output *out);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 11/22] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (9 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 10/22] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-07-04  7:28   ` Yuan Yao
  2023-06-26 14:12 ` [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

Start to transit out the "multi-steps" to construct a list of "TD Memory
Regions" (TDMRs) to cover all TDX-usable memory regions.

The kernel configures TDX-usable memory regions by passing a list of
TDMRs "TD Memory Regions" (TDMRs) to the TDX module.  Each TDMR contains
the information of the base/size of a memory region, the base/size of the
associated Physical Address Metadata Table (PAMT) and a list of reserved
areas in the region.

Do the first step to fill out a number of TDMRs to cover all TDX memory
regions.  To keep it simple, always try to use one TDMR for each memory
region.  As the first step only set up the base/size for each TDMR.

Each TDMR must be 1G aligned and the size must be in 1G granularity.
This implies that one TDMR could cover multiple memory regions.  If a
memory region spans the 1GB boundary and the former part is already
covered by the previous TDMR, just use a new TDMR for the remaining
part.

TDX only supports a limited number of TDMRs.  Disable TDX if all TDMRs
are consumed but there is more memory region to cover.

There are fancier things that could be done like trying to merge
adjacent TDMRs.  This would allow more pathological memory layouts to be
supported.  But, current systems are not even close to exhausting the
existing TDMR resources in practice.  For now, keep it simple.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

v11 -> v12:
 - Improved comments around looping over TDX memblock to create TDMRs.
   (Dave).
 - Added code to pr_warn() when consumed TDMRs reaching maximum TDMRs
   (Dave).
 - BIT_ULL(30) -> SZ_1G (Kirill)
 - Removed unused TDMR_PFN_ALIGNMENT (Sathy)
 - Added tags from Kirill/Sathy

v10 -> v11:
 - No update

v9 -> v10:
 - No change.

v8 -> v9:

 - Added the last paragraph in the changelog (Dave).
 - Removed unnecessary type cast in tdmr_entry() (Dave).

---
 arch/x86/virt/vmx/tdx/tdx.c | 103 +++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |   3 ++
 2 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index e28615b60f9b..2ffc1517a93b 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -341,6 +341,102 @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
 			tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
 }
 
+/* Get the TDMR from the list at the given index. */
+static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list,
+				    int idx)
+{
+	int tdmr_info_offset = tdmr_list->tdmr_sz * idx;
+
+	return (void *)tdmr_list->tdmrs + tdmr_info_offset;
+}
+
+#define TDMR_ALIGNMENT		SZ_1G
+#define TDMR_ALIGN_DOWN(_addr)	ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
+#define TDMR_ALIGN_UP(_addr)	ALIGN((_addr), TDMR_ALIGNMENT)
+
+static inline u64 tdmr_end(struct tdmr_info *tdmr)
+{
+	return tdmr->base + tdmr->size;
+}
+
+/*
+ * Take the memory referenced in @tmb_list and populate the
+ * preallocated @tdmr_list, following all the special alignment
+ * and size rules for TDMR.
+ */
+static int fill_out_tdmrs(struct list_head *tmb_list,
+			  struct tdmr_info_list *tdmr_list)
+{
+	struct tdx_memblock *tmb;
+	int tdmr_idx = 0;
+
+	/*
+	 * Loop over TDX memory regions and fill out TDMRs to cover them.
+	 * To keep it simple, always try to use one TDMR to cover one
+	 * memory region.
+	 *
+	 * In practice TDX supports at least 64 TDMRs.  A 2-socket system
+	 * typically only consumes less than 10 of those.  This code is
+	 * dumb and simple and may use more TMDRs than is strictly
+	 * required.
+	 */
+	list_for_each_entry(tmb, tmb_list, list) {
+		struct tdmr_info *tdmr = tdmr_entry(tdmr_list, tdmr_idx);
+		u64 start, end;
+
+		start = TDMR_ALIGN_DOWN(PFN_PHYS(tmb->start_pfn));
+		end   = TDMR_ALIGN_UP(PFN_PHYS(tmb->end_pfn));
+
+		/*
+		 * A valid size indicates the current TDMR has already
+		 * been filled out to cover the previous memory region(s).
+		 */
+		if (tdmr->size) {
+			/*
+			 * Loop to the next if the current memory region
+			 * has already been fully covered.
+			 */
+			if (end <= tdmr_end(tdmr))
+				continue;
+
+			/* Otherwise, skip the already covered part. */
+			if (start < tdmr_end(tdmr))
+				start = tdmr_end(tdmr);
+
+			/*
+			 * Create a new TDMR to cover the current memory
+			 * region, or the remaining part of it.
+			 */
+			tdmr_idx++;
+			if (tdmr_idx >= tdmr_list->max_tdmrs) {
+				pr_warn("initialization failed: TDMRs exhausted.\n");
+				return -ENOSPC;
+			}
+
+			tdmr = tdmr_entry(tdmr_list, tdmr_idx);
+		}
+
+		tdmr->base = start;
+		tdmr->size = end - start;
+	}
+
+	/* @tdmr_idx is always the index of the last valid TDMR. */
+	tdmr_list->nr_consumed_tdmrs = tdmr_idx + 1;
+
+	/*
+	 * Warn early that kernel is about to run out of TDMRs.
+	 *
+	 * This is an indication that TDMR allocation has to be
+	 * reworked to be smarter to not run into an issue.
+	 */
+	if (tdmr_list->max_tdmrs - tdmr_list->nr_consumed_tdmrs < TDMR_NR_WARN)
+		pr_warn("consumed TDMRs reaching limit: %d used out of %d\n",
+				tdmr_list->nr_consumed_tdmrs,
+				tdmr_list->max_tdmrs);
+
+	return 0;
+}
+
 /*
  * Construct a list of TDMRs on the preallocated space in @tdmr_list
  * to cover all TDX memory regions in @tmb_list based on the TDX module
@@ -350,10 +446,15 @@ static int construct_tdmrs(struct list_head *tmb_list,
 			   struct tdmr_info_list *tdmr_list,
 			   struct tdsysinfo_struct *sysinfo)
 {
+	int ret;
+
+	ret = fill_out_tdmrs(tmb_list, tdmr_list);
+	if (ret)
+		return ret;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Fill out TDMRs to cover all TDX memory regions.
 	 *  - Allocate and set up PAMTs for each TDMR.
 	 *  - Designate reserved areas for each TDMR.
 	 *
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 193764afc602..3086f7ad0522 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -123,6 +123,9 @@ struct tdx_memblock {
 	unsigned long end_pfn;
 };
 
+/* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */
+#define TDMR_NR_WARN 4
+
 struct tdmr_info_list {
 	void *tdmrs;	/* Flexible array to hold 'tdmr_info's */
 	int nr_consumed_tdmrs;	/* How many 'tdmr_info's are in use */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (10 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 11/22] x86/virt/tdx: Fill out " Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-27  9:51   ` kirill.shutemov
                     ` (2 more replies)
  2023-06-26 14:12 ` [PATCH v12 13/22] x86/virt/tdx: Designate reserved areas for all TDMRs Kai Huang
                   ` (10 subsequent siblings)
  22 siblings, 3 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

The TDX module uses additional metadata to record things like which
guest "owns" a given page of memory.  This metadata, referred as
Physical Address Metadata Table (PAMT), essentially serves as the
'struct page' for the TDX module.  PAMTs are not reserved by hardware
up front.  They must be allocated by the kernel and then given to the
TDX module during module initialization.

TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
(TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
be a physically contiguous area from a Convertible Memory Region (CMR).
However, the PAMTs which track pages in one TDMR do not need to reside
within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
any TDMR, the overlapping part must be reported as a reserved area in
that particular TDMR.

Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).
The downside is alloc_contig_pages() may fail at runtime.  One (bad)
mitigation is to launch a TDX guest early during system boot to get
those PAMTs allocated at early time, but the only way to fix is to add a
boot option to allocate or reserve PAMTs during kernel boot.

It is imperfect but will be improved on later.

TDX only supports a limited number of reserved areas per TDMR to cover
both PAMTs and memory holes within the given TDMR.  If many PAMTs are
allocated within a single TDMR, the reserved areas may not be sufficient
to cover all of them.

Adopt the following policies when allocating PAMTs for a given TDMR:

  - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
    the total number of reserved areas consumed for PAMTs.
  - Try to first allocate PAMT from the local node of the TDMR for better
    NUMA locality.

Also dump out how many pages are allocated for PAMTs when the TDX module
is initialized successfully.  This helps answer the eternal "where did
all my memory go?" questions.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---

v11 -> v12:
 - Moved TDX_PS_NUM from tdx.c to <asm/tdx.h> (Kirill)
 - "<= TDX_PS_1G" -> "< TDX_PS_NUM" (Kirill)
 - Changed tdmr_get_pamt() to return base and size instead of base_pfn
   and npages and related code directly (Dave).
 - Simplified PAMT kb counting. (Dave)
 - tdmrs_count_pamt_pages() -> tdmr_count_pamt_kb() (Kirill/Dave)

v10 -> v11:
 - No update

v9 -> v10:
 - Removed code change in disable_tdx_module() as it doesn't exist
   anymore.

v8 -> v9:
 - Added TDX_PS_NR macro instead of open-coding (Dave).
 - Better alignment of 'pamt_entry_size' in tdmr_set_up_pamt() (Dave).
 - Changed to print out PAMTs in "KBs" instead of "pages" (Dave).
 - Added Dave's Reviewed-by.

v7 -> v8: (Dave)
 - Changelog:
  - Added a sentence to state PAMT allocation will be improved.
  - Others suggested by Dave.
 - Moved 'nid' of 'struct tdx_memblock' to this patch.
 - Improved comments around tdmr_get_nid().
 - WARN_ON_ONCE() -> pr_warn() in tdmr_get_nid().
 - Other changes due to 'struct tdmr_info_list'.

v6 -> v7:
 - Changes due to using macros instead of 'enum' for TDX supported page
   sizes.

v5 -> v6:
 - Rebase due to using 'tdx_memblock' instead of memblock.
 - 'int pamt_entry_nr' -> 'unsigned long nr_pamt_entries' (Dave/Sagis).
 - Improved comment around tdmr_get_nid() (Dave).
 - Improved comment in tdmr_set_up_pamt() around breaking the PAMT
   into PAMTs for 4K/2M/1G (Dave).
 - tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave).   

- v3 -> v5 (no feedback on v4):
 - Used memblock to get the NUMA node for given TDMR.
 - Removed tdmr_get_pamt_sz() helper but use open-code instead.
 - Changed to use 'switch .. case..' for each TDX supported page size in
   tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()).
 - Added printing out memory used for PAMT allocation when TDX module is
   initialized successfully.
 - Explained downside of alloc_contig_pages() in changelog.
 - Addressed other minor comments.


---
 arch/x86/Kconfig            |   1 +
 arch/x86/include/asm/tdx.h  |   1 +
 arch/x86/virt/vmx/tdx/tdx.c | 215 +++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |   1 +
 4 files changed, 213 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2226d8a4c749..ad364f01de33 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
 	depends on KVM_INTEL
 	depends on X86_X2APIC
 	select ARCH_KEEP_MEMBLOCK
+	depends on CONTIG_ALLOC
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d8226a50c58c..91416fd600cd 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -24,6 +24,7 @@
 #define TDX_PS_4K	0
 #define TDX_PS_2M	1
 #define TDX_PS_1G	2
+#define TDX_PS_NR	(TDX_PS_1G + 1)
 
 /*
  * Used to gather the output registers values of the TDCALL and SEAMCALL
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 2ffc1517a93b..fd5417577f26 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -221,7 +221,7 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
  * overlap.
  */
 static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
-			    unsigned long end_pfn)
+			    unsigned long end_pfn, int nid)
 {
 	struct tdx_memblock *tmb;
 
@@ -232,6 +232,7 @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
 	INIT_LIST_HEAD(&tmb->list);
 	tmb->start_pfn = start_pfn;
 	tmb->end_pfn = end_pfn;
+	tmb->nid = nid;
 
 	/* @tmb_list is protected by mem_hotplug_lock */
 	list_add_tail(&tmb->list, tmb_list);
@@ -259,9 +260,9 @@ static void free_tdx_memlist(struct list_head *tmb_list)
 static int build_tdx_memlist(struct list_head *tmb_list)
 {
 	unsigned long start_pfn, end_pfn;
-	int i, ret;
+	int i, nid, ret;
 
-	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
 		/*
 		 * The first 1MB is not reported as TDX convertible memory.
 		 * Although the first 1MB is always reserved and won't end up
@@ -277,7 +278,7 @@ static int build_tdx_memlist(struct list_head *tmb_list)
 		 * memblock has already guaranteed they are in address
 		 * ascending order and don't overlap.
 		 */
-		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
+		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
 		if (ret)
 			goto err;
 	}
@@ -437,6 +438,202 @@ static int fill_out_tdmrs(struct list_head *tmb_list,
 	return 0;
 }
 
+/*
+ * Calculate PAMT size given a TDMR and a page size.  The returned
+ * PAMT size is always aligned up to 4K page boundary.
+ */
+static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz,
+				      u16 pamt_entry_size)
+{
+	unsigned long pamt_sz, nr_pamt_entries;
+
+	switch (pgsz) {
+	case TDX_PS_4K:
+		nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
+		break;
+	case TDX_PS_2M:
+		nr_pamt_entries = tdmr->size >> PMD_SHIFT;
+		break;
+	case TDX_PS_1G:
+		nr_pamt_entries = tdmr->size >> PUD_SHIFT;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return 0;
+	}
+
+	pamt_sz = nr_pamt_entries * pamt_entry_size;
+	/* TDX requires PAMT size must be 4K aligned */
+	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
+
+	return pamt_sz;
+}
+
+/*
+ * Locate a NUMA node which should hold the allocation of the @tdmr
+ * PAMT.  This node will have some memory covered by the TDMR.  The
+ * relative amount of memory covered is not considered.
+ */
+static int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_list)
+{
+	struct tdx_memblock *tmb;
+
+	/*
+	 * A TDMR must cover at least part of one TMB.  That TMB will end
+	 * after the TDMR begins.  But, that TMB may have started before
+	 * the TDMR.  Find the next 'tmb' that _ends_ after this TDMR
+	 * begins.  Ignore 'tmb' start addresses.  They are irrelevant.
+	 */
+	list_for_each_entry(tmb, tmb_list, list) {
+		if (tmb->end_pfn > PHYS_PFN(tdmr->base))
+			return tmb->nid;
+	}
+
+	/*
+	 * Fall back to allocating the TDMR's metadata from node 0 when
+	 * no TDX memory block can be found.  This should never happen
+	 * since TDMRs originate from TDX memory blocks.
+	 */
+	pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT allocation, fallback to use node 0.\n",
+			tdmr->base, tdmr_end(tdmr));
+	return 0;
+}
+
+/*
+ * Allocate PAMTs from the local NUMA node of some memory in @tmb_list
+ * within @tdmr, and set up PAMTs for @tdmr.
+ */
+static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
+			    struct list_head *tmb_list,
+			    u16 pamt_entry_size)
+{
+	unsigned long pamt_base[TDX_PS_NR];
+	unsigned long pamt_size[TDX_PS_NR];
+	unsigned long tdmr_pamt_base;
+	unsigned long tdmr_pamt_size;
+	struct page *pamt;
+	int pgsz, nid;
+
+	nid = tdmr_get_nid(tdmr, tmb_list);
+
+	/*
+	 * Calculate the PAMT size for each TDX supported page size
+	 * and the total PAMT size.
+	 */
+	tdmr_pamt_size = 0;
+	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR ; pgsz++) {
+		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
+					pamt_entry_size);
+		tdmr_pamt_size += pamt_size[pgsz];
+	}
+
+	/*
+	 * Allocate one chunk of physically contiguous memory for all
+	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
+	 * in overlapped TDMRs.
+	 */
+	pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
+			nid, &node_online_map);
+	if (!pamt)
+		return -ENOMEM;
+
+	/*
+	 * Break the contiguous allocation back up into the
+	 * individual PAMTs for each page size.
+	 */
+	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
+	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
+		pamt_base[pgsz] = tdmr_pamt_base;
+		tdmr_pamt_base += pamt_size[pgsz];
+	}
+
+	tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
+	tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
+	tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
+	tdmr->pamt_2m_size = pamt_size[TDX_PS_2M];
+	tdmr->pamt_1g_base = pamt_base[TDX_PS_1G];
+	tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
+
+	return 0;
+}
+
+static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base,
+			  unsigned long *pamt_size)
+{
+	unsigned long pamt_bs, pamt_sz;
+
+	/*
+	 * The PAMT was allocated in one contiguous unit.  The 4K PAMT
+	 * should always point to the beginning of that allocation.
+	 */
+	pamt_bs = tdmr->pamt_4k_base;
+	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
+
+	WARN_ON_ONCE((pamt_bs & ~PAGE_MASK) || (pamt_sz & ~PAGE_MASK));
+
+	*pamt_base = pamt_bs;
+	*pamt_size = pamt_sz;
+}
+
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
+{
+	unsigned long pamt_base, pamt_size;
+
+	tdmr_get_pamt(tdmr, &pamt_base, &pamt_size);
+
+	/* Do nothing if PAMT hasn't been allocated for this TDMR */
+	if (!pamt_size)
+		return;
+
+	if (WARN_ON_ONCE(!pamt_base))
+		return;
+
+	free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT);
+}
+
+static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
+{
+	int i;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
+		tdmr_free_pamt(tdmr_entry(tdmr_list, i));
+}
+
+/* Allocate and set up PAMTs for all TDMRs */
+static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
+				 struct list_head *tmb_list,
+				 u16 pamt_entry_size)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list,
+				pamt_entry_size);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	tdmrs_free_pamt_all(tdmr_list);
+	return ret;
+}
+
+static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
+{
+	unsigned long pamt_size = 0;
+	int i;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		unsigned long base, size;
+
+		tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
+		pamt_size += size;
+	}
+
+	return pamt_size / 1024;
+}
+
 /*
  * Construct a list of TDMRs on the preallocated space in @tdmr_list
  * to cover all TDX memory regions in @tmb_list based on the TDX module
@@ -452,10 +649,13 @@ static int construct_tdmrs(struct list_head *tmb_list,
 	if (ret)
 		return ret;
 
+	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list,
+			sysinfo->pamt_entry_size);
+	if (ret)
+		return ret;
 	/*
 	 * TODO:
 	 *
-	 *  - Allocate and set up PAMTs for each TDMR.
 	 *  - Designate reserved areas for each TDMR.
 	 *
 	 * Return -EINVAL until constructing TDMRs is done
@@ -526,6 +726,11 @@ static int init_tdx_module(void)
 	 *  Return error before all steps are done.
 	 */
 	ret = -EINVAL;
+	if (ret)
+		tdmrs_free_pamt_all(&tdmr_list);
+	else
+		pr_info("%lu KBs allocated for PAMT.\n",
+				tdmrs_count_pamt_kb(&tdmr_list));
 out_free_tdmrs:
 	/*
 	 * Always free the buffer of TDMRs as they are only used during
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 3086f7ad0522..9b5a65f37e8b 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -121,6 +121,7 @@ struct tdx_memblock {
 	struct list_head list;
 	unsigned long start_pfn;
 	unsigned long end_pfn;
+	int nid;
 };
 
 /* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 13/22] x86/virt/tdx: Designate reserved areas for all TDMRs
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (11 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-07-05  5:29   ` Yuan Yao
  2023-06-26 14:12 ` [PATCH v12 14/22] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID Kai Huang
                   ` (9 subsequent siblings)
  22 siblings, 1 reply; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

As the last step of constructing TDMRs, populate reserved areas for all
TDMRs.  For each TDMR, put all memory holes within this TDMR to the
reserved areas.  And for all PAMTs which overlap with this TDMR, put
all the overlapping parts to reserved areas too.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v11 -> v12:
 - Code change due to tdmr_get_pamt() change from returning pfn/npages to
   base/size
 - Added Kirill's tag

v10 -> v11:
 - No update

v9 -> v10:
 - No change.

v8 -> v9:
 - Added comment around 'tdmr_add_rsvd_area()' to point out it doesn't do
   optimization to save reserved areas. (Dave).

v7 -> v8: (Dave)
 - "set_up" -> "populate" in function name change (Dave).
 - Improved comment suggested by Dave.
 - Other changes due to 'struct tdmr_info_list'.

v6 -> v7:
 - No change.

v5 -> v6:
 - Rebase due to using 'tdx_memblock' instead of memblock.
 - Split tdmr_set_up_rsvd_areas() into two functions to handle memory
   hole and PAMT respectively.
 - Added Isaku's Reviewed-by.


---
 arch/x86/virt/vmx/tdx/tdx.c | 217 ++++++++++++++++++++++++++++++++++--
 1 file changed, 209 insertions(+), 8 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index fd5417577f26..2bcace5cb25c 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -25,6 +25,7 @@
 #include <linux/sizes.h>
 #include <linux/pfn.h>
 #include <linux/align.h>
+#include <linux/sort.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/archrandom.h>
@@ -634,6 +635,207 @@ static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
 	return pamt_size / 1024;
 }
 
+static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
+			      u64 size, u16 max_reserved_per_tdmr)
+{
+	struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
+	int idx = *p_idx;
+
+	/* Reserved area must be 4K aligned in offset and size */
+	if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
+		return -EINVAL;
+
+	if (idx >= max_reserved_per_tdmr) {
+		pr_warn("initialization failed: TDMR [0x%llx, 0x%llx): reserved areas exhausted.\n",
+				tdmr->base, tdmr_end(tdmr));
+		return -ENOSPC;
+	}
+
+	/*
+	 * Consume one reserved area per call.  Make no effort to
+	 * optimize or reduce the number of reserved areas which are
+	 * consumed by contiguous reserved areas, for instance.
+	 */
+	rsvd_areas[idx].offset = addr - tdmr->base;
+	rsvd_areas[idx].size = size;
+
+	*p_idx = idx + 1;
+
+	return 0;
+}
+
+/*
+ * Go through @tmb_list to find holes between memory areas.  If any of
+ * those holes fall within @tdmr, set up a TDMR reserved area to cover
+ * the hole.
+ */
+static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
+				    struct tdmr_info *tdmr,
+				    int *rsvd_idx,
+				    u16 max_reserved_per_tdmr)
+{
+	struct tdx_memblock *tmb;
+	u64 prev_end;
+	int ret;
+
+	/*
+	 * Start looking for reserved blocks at the
+	 * beginning of the TDMR.
+	 */
+	prev_end = tdmr->base;
+	list_for_each_entry(tmb, tmb_list, list) {
+		u64 start, end;
+
+		start = PFN_PHYS(tmb->start_pfn);
+		end   = PFN_PHYS(tmb->end_pfn);
+
+		/* Break if this region is after the TDMR */
+		if (start >= tdmr_end(tdmr))
+			break;
+
+		/* Exclude regions before this TDMR */
+		if (end < tdmr->base)
+			continue;
+
+		/*
+		 * Skip over memory areas that
+		 * have already been dealt with.
+		 */
+		if (start <= prev_end) {
+			prev_end = end;
+			continue;
+		}
+
+		/* Add the hole before this region */
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
+				start - prev_end,
+				max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+
+		prev_end = end;
+	}
+
+	/* Add the hole after the last region if it exists. */
+	if (prev_end < tdmr_end(tdmr)) {
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
+				tdmr_end(tdmr) - prev_end,
+				max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/*
+ * Go through @tdmr_list to find all PAMTs.  If any of those PAMTs
+ * overlaps with @tdmr, set up a TDMR reserved area to cover the
+ * overlapping part.
+ */
+static int tdmr_populate_rsvd_pamts(struct tdmr_info_list *tdmr_list,
+				    struct tdmr_info *tdmr,
+				    int *rsvd_idx,
+				    u16 max_reserved_per_tdmr)
+{
+	int i, ret;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		struct tdmr_info *tmp = tdmr_entry(tdmr_list, i);
+		unsigned long pamt_base, pamt_size, pamt_end;
+
+		tdmr_get_pamt(tmp, &pamt_base, &pamt_size);
+		/* Each TDMR must already have PAMT allocated */
+		WARN_ON_ONCE(!pamt_size|| !pamt_base);
+
+		pamt_end = pamt_base + pamt_size;
+		/* Skip PAMTs outside of the given TDMR */
+		if ((pamt_end <= tdmr->base) ||
+				(pamt_base >= tdmr_end(tdmr)))
+			continue;
+
+		/* Only mark the part within the TDMR as reserved */
+		if (pamt_base < tdmr->base)
+			pamt_base = tdmr->base;
+		if (pamt_end > tdmr_end(tdmr))
+			pamt_end = tdmr_end(tdmr);
+
+		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_base,
+				pamt_end - pamt_base,
+				max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/* Compare function called by sort() for TDMR reserved areas */
+static int rsvd_area_cmp_func(const void *a, const void *b)
+{
+	struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a;
+	struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b;
+
+	if (r1->offset + r1->size <= r2->offset)
+		return -1;
+	if (r1->offset >= r2->offset + r2->size)
+		return 1;
+
+	/* Reserved areas cannot overlap.  The caller must guarantee. */
+	WARN_ON_ONCE(1);
+	return -1;
+}
+
+/*
+ * Populate reserved areas for the given @tdmr, including memory holes
+ * (via @tmb_list) and PAMTs (via @tdmr_list).
+ */
+static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
+				    struct list_head *tmb_list,
+				    struct tdmr_info_list *tdmr_list,
+				    u16 max_reserved_per_tdmr)
+{
+	int ret, rsvd_idx = 0;
+
+	ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx,
+			max_reserved_per_tdmr);
+	if (ret)
+		return ret;
+
+	ret = tdmr_populate_rsvd_pamts(tdmr_list, tdmr, &rsvd_idx,
+			max_reserved_per_tdmr);
+	if (ret)
+		return ret;
+
+	/* TDX requires reserved areas listed in address ascending order */
+	sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
+			rsvd_area_cmp_func, NULL);
+
+	return 0;
+}
+
+/*
+ * Populate reserved areas for all TDMRs in @tdmr_list, including memory
+ * holes (via @tmb_list) and PAMTs.
+ */
+static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
+					 struct list_head *tmb_list,
+					 u16 max_reserved_per_tdmr)
+{
+	int i;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		int ret;
+
+		ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i),
+				tmb_list, tdmr_list, max_reserved_per_tdmr);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Construct a list of TDMRs on the preallocated space in @tdmr_list
  * to cover all TDX memory regions in @tmb_list based on the TDX module
@@ -653,14 +855,13 @@ static int construct_tdmrs(struct list_head *tmb_list,
 			sysinfo->pamt_entry_size);
 	if (ret)
 		return ret;
-	/*
-	 * TODO:
-	 *
-	 *  - Designate reserved areas for each TDMR.
-	 *
-	 * Return -EINVAL until constructing TDMRs is done
-	 */
-	return -EINVAL;
+
+	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
+			sysinfo->max_reserved_per_tdmr);
+	if (ret)
+		tdmrs_free_pamt_all(tdmr_list);
+
+	return ret;
 }
 
 static int init_tdx_module(void)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 14/22] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (12 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 13/22] x86/virt/tdx: Designate reserved areas for all TDMRs Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-07-05  6:49   ` Yuan Yao
  2023-06-26 14:12 ` [PATCH v12 15/22] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

The TDX module uses a private KeyID as the "global KeyID" for mapping
things like the PAMT and other TDX metadata.  This KeyID has already
been reserved when detecting TDX during the kernel early boot.

After the list of "TD Memory Regions" (TDMRs) has been constructed to
cover all TDX-usable memory regions, the next step is to pass them to
the TDX module together with the global KeyID.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v11 -> v12:
 - Added Kirill's tag

v10 -> v11:
 - No update

v9 -> v10:
 - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'.

v8 -> v9:
 - Improved changlog to explain why initializing TDMRs can take long
   time (Dave).
 - Improved comments around 'next-to-initialize' address (Dave).

v7 -> v8: (Dave)
 - Changelog:
   - explicitly call out this is the last step of TDX module initialization.
   - Trimed down changelog by removing SEAMCALL name and details.
 - Removed/trimmed down unnecessary comments.
 - Other changes due to 'struct tdmr_info_list'.

v6 -> v7:
 - Removed need_resched() check. -- Andi.


---
 arch/x86/virt/vmx/tdx/tdx.c | 41 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  2 ++
 2 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 2bcace5cb25c..1992245290de 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -26,6 +26,7 @@
 #include <linux/pfn.h>
 #include <linux/align.h>
 #include <linux/sort.h>
+#include <linux/log2.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/archrandom.h>
@@ -864,6 +865,39 @@ static int construct_tdmrs(struct list_head *tmb_list,
 	return ret;
 }
 
+static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
+{
+	u64 *tdmr_pa_array;
+	size_t array_sz;
+	int i, ret;
+
+	/*
+	 * TDMRs are passed to the TDX module via an array of physical
+	 * addresses of each TDMR.  The array itself also has certain
+	 * alignment requirement.
+	 */
+	array_sz = tdmr_list->nr_consumed_tdmrs * sizeof(u64);
+	array_sz = roundup_pow_of_two(array_sz);
+	if (array_sz < TDMR_INFO_PA_ARRAY_ALIGNMENT)
+		array_sz = TDMR_INFO_PA_ARRAY_ALIGNMENT;
+
+	tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);
+	if (!tdmr_pa_array)
+		return -ENOMEM;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
+		tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i));
+
+	ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array),
+				tdmr_list->nr_consumed_tdmrs,
+				global_keyid, 0, NULL, NULL);
+
+	/* Free the array as it is not required anymore. */
+	kfree(tdmr_pa_array);
+
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	struct tdsysinfo_struct *sysinfo;
@@ -917,16 +951,21 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_tdmrs;
 
+	/* Pass the TDMRs and the global KeyID to the TDX module */
+	ret = config_tdx_module(&tdmr_list, tdx_global_keyid);
+	if (ret)
+		goto out_free_pamts;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Configure the TDMRs and the global KeyID to the TDX module.
 	 *  - Configure the global KeyID on all packages.
 	 *  - Initialize all TDMRs.
 	 *
 	 *  Return error before all steps are done.
 	 */
 	ret = -EINVAL;
+out_free_pamts:
 	if (ret)
 		tdmrs_free_pamt_all(&tdmr_list);
 	else
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 9b5a65f37e8b..c386aa3afe2a 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -24,6 +24,7 @@
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
+#define TDH_SYS_CONFIG		45
 
 struct cmr_info {
 	u64	base;
@@ -88,6 +89,7 @@ struct tdmr_reserved_area {
 } __packed;
 
 #define TDMR_INFO_ALIGNMENT	512
+#define TDMR_INFO_PA_ARRAY_ALIGNMENT	512
 
 struct tdmr_info {
 	u64 base;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 15/22] x86/virt/tdx: Configure global KeyID on all packages
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (13 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 14/22] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-07-05  8:13   ` Yuan Yao
  2023-06-26 14:12 ` [PATCH v12 16/22] x86/virt/tdx: Initialize all TDMRs Kai Huang
                   ` (7 subsequent siblings)
  22 siblings, 1 reply; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

After the list of TDMRs and the global KeyID are configured to the TDX
module, the kernel needs to configure the key of the global KeyID on all
packages using TDH.SYS.KEY.CONFIG.

This SEAMCALL cannot run parallel on different cpus.  Loop all online
cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of
each package.

To keep things simple, this implementation takes no affirmative steps to
online cpus to make sure there's at least one cpu for each package.  The
callers (aka. KVM) can ensure success by ensuring sufficient CPUs are
online for this to succeed.

Intel hardware doesn't guarantee cache coherency across different
KeyIDs.  The PAMTs are transitioning from being used by the kernel
mapping (KeyId 0) to the TDX module's "global KeyID" mapping.

This means that the kernel must flush any dirty KeyID-0 PAMT cachelines
before the TDX module uses the global KeyID to access the PAMTs.
Otherwise, if those dirty cachelines were written back, they would
corrupt the TDX module's metadata.  Aside: This corruption would be
detected by the memory integrity hardware on the next read of the memory
with the global KeyID.  The result would likely be fatal to the system
but would not impact TDX security.

Following the TDX module specification, flush cache before configuring
the global KeyID on all packages.  Given the PAMT size can be large
(~1/256th of system RAM), just use WBINVD on all CPUs to flush.

If TDH.SYS.KEY.CONFIG fails, the TDX module may already have used the
global KeyID to write the PAMTs.  Therefore, use WBINVD to flush cache
before returning the PAMTs back to the kernel.  Also convert all PAMTs
back to normal by using MOVDIR64B as suggested by the TDX module spec,
although on the platform without the "partial write machine check"
erratum it's OK to leave PAMTs as is.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v11 -> v12:
 - Added Kirill's tag
 - Improved changelog (Nikolay)

v10 -> v11:
 - Convert PAMTs back to normal when module initialization fails.
 - Fixed an error in changelog

v9 -> v10:
 - Changed to use 'smp_call_on_cpu()' directly to do key configuration.

v8 -> v9:
 - Improved changelog (Dave).
 - Improved comments to explain the function to configure global KeyID
   "takes no affirmative action to online any cpu". (Dave).
 - Improved other comments suggested by Dave.

v7 -> v8: (Dave)
 - Changelog changes:
  - Point out this is the step of "multi-steps" of init_tdx_module().
  - Removed MOVDIR64B part.
  - Other changes due to removing TDH.SYS.SHUTDOWN and TDH.SYS.LP.INIT.
 - Changed to loop over online cpus and use smp_call_function_single()
   directly as the patch to shut down TDX module has been removed.
 - Removed MOVDIR64B part in comment.

v6 -> v7:
 - Improved changelong and comment to explain why MOVDIR64B isn't used
   when returning PAMTs back to the kernel.


---
 arch/x86/virt/vmx/tdx/tdx.c | 135 +++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |   1 +
 2 files changed, 134 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 1992245290de..f5d4dbc11aee 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -31,6 +31,7 @@
 #include <asm/msr.h>
 #include <asm/archrandom.h>
 #include <asm/page.h>
+#include <asm/special_insns.h>
 #include <asm/tdx.h>
 #include "tdx.h"
 
@@ -577,7 +578,8 @@ static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base,
 	*pamt_size = pamt_sz;
 }
 
-static void tdmr_free_pamt(struct tdmr_info *tdmr)
+static void tdmr_do_pamt_func(struct tdmr_info *tdmr,
+		void (*pamt_func)(unsigned long base, unsigned long size))
 {
 	unsigned long pamt_base, pamt_size;
 
@@ -590,9 +592,19 @@ static void tdmr_free_pamt(struct tdmr_info *tdmr)
 	if (WARN_ON_ONCE(!pamt_base))
 		return;
 
+	(*pamt_func)(pamt_base, pamt_size);
+}
+
+static void free_pamt(unsigned long pamt_base, unsigned long pamt_size)
+{
 	free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT);
 }
 
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
+{
+	tdmr_do_pamt_func(tdmr, free_pamt);
+}
+
 static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
 {
 	int i;
@@ -621,6 +633,41 @@ static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
 	return ret;
 }
 
+/*
+ * Convert TDX private pages back to normal by using MOVDIR64B to
+ * clear these pages.  Note this function doesn't flush cache of
+ * these TDX private pages.  The caller should make sure of that.
+ */
+static void reset_tdx_pages(unsigned long base, unsigned long size)
+{
+	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
+	unsigned long phys, end;
+
+	end = base + size;
+	for (phys = base; phys < end; phys += 64)
+		movdir64b(__va(phys), zero_page);
+
+	/*
+	 * MOVDIR64B uses WC protocol.  Use memory barrier to
+	 * make sure any later user of these pages sees the
+	 * updated data.
+	 */
+	mb();
+}
+
+static void tdmr_reset_pamt(struct tdmr_info *tdmr)
+{
+	tdmr_do_pamt_func(tdmr, reset_tdx_pages);
+}
+
+static void tdmrs_reset_pamt_all(struct tdmr_info_list *tdmr_list)
+{
+	int i;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
+		tdmr_reset_pamt(tdmr_entry(tdmr_list, i));
+}
+
 static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
 {
 	unsigned long pamt_size = 0;
@@ -898,6 +945,55 @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
 	return ret;
 }
 
+static int do_global_key_config(void *data)
+{
+	/*
+	 * TDH.SYS.KEY.CONFIG may fail with entropy error (which is a
+	 * recoverable error).  Assume this is exceedingly rare and
+	 * just return error if encountered instead of retrying.
+	 *
+	 * All '0's are just unused parameters.
+	 */
+	return seamcall(TDH_SYS_KEY_CONFIG, 0, 0, 0, 0, NULL, NULL);
+}
+
+/*
+ * Attempt to configure the global KeyID on all physical packages.
+ *
+ * This requires running code on at least one CPU in each package.  If a
+ * package has no online CPUs, that code will not run and TDX module
+ * initialization (TDMR initialization) will fail.
+ *
+ * This code takes no affirmative steps to online CPUs.  Callers (aka.
+ * KVM) can ensure success by ensuring sufficient CPUs are online for
+ * this to succeed.
+ */
+static int config_global_keyid(void)
+{
+	cpumask_var_t packages;
+	int cpu, ret = -EINVAL;
+
+	if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
+		return -ENOMEM;
+
+	for_each_online_cpu(cpu) {
+		if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
+					packages))
+			continue;
+
+		/*
+		 * TDH.SYS.KEY.CONFIG cannot run concurrently on
+		 * different cpus, so just do it one by one.
+		 */
+		ret = smp_call_on_cpu(cpu, do_global_key_config, NULL, true);
+		if (ret)
+			break;
+	}
+
+	free_cpumask_var(packages);
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	struct tdsysinfo_struct *sysinfo;
@@ -956,15 +1052,47 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
+	/*
+	 * Hardware doesn't guarantee cache coherency across different
+	 * KeyIDs.  The kernel needs to flush PAMT's dirty cachelines
+	 * (associated with KeyID 0) before the TDX module can use the
+	 * global KeyID to access the PAMT.  Given PAMTs are potentially
+	 * large (~1/256th of system RAM), just use WBINVD on all cpus
+	 * to flush the cache.
+	 */
+	wbinvd_on_all_cpus();
+
+	/* Config the key of global KeyID on all packages */
+	ret = config_global_keyid();
+	if (ret)
+		goto out_reset_pamts;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Configure the global KeyID on all packages.
 	 *  - Initialize all TDMRs.
 	 *
 	 *  Return error before all steps are done.
 	 */
 	ret = -EINVAL;
+out_reset_pamts:
+	if (ret) {
+		/*
+		 * Part of PAMTs may already have been initialized by the
+		 * TDX module.  Flush cache before returning PAMTs back
+		 * to the kernel.
+		 */
+		wbinvd_on_all_cpus();
+		/*
+		 * According to the TDX hardware spec, if the platform
+		 * doesn't have the "partial write machine check"
+		 * erratum, any kernel read/write will never cause #MC
+		 * in kernel space, thus it's OK to not convert PAMTs
+		 * back to normal.  But do the conversion anyway here
+		 * as suggested by the TDX spec.
+		 */
+		tdmrs_reset_pamt_all(&tdmr_list);
+	}
 out_free_pamts:
 	if (ret)
 		tdmrs_free_pamt_all(&tdmr_list);
@@ -1019,6 +1147,9 @@ static int __tdx_enable(void)
  * lock to prevent any new cpu from becoming online; 2) done both VMXON
  * and tdx_cpu_enable() on all online cpus.
  *
+ * This function requires there's at least one online cpu for each CPU
+ * package to succeed.
+ *
  * This function can be called in parallel by multiple callers.
  *
  * Return 0 if TDX is enabled successfully, otherwise error.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index c386aa3afe2a..a0438513bec0 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -21,6 +21,7 @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_KEY_CONFIG	31
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 16/22] x86/virt/tdx: Initialize all TDMRs
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (14 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 15/22] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-07-06  5:31   ` Yuan Yao
  2023-06-26 14:12 ` [PATCH v12 17/22] x86/kexec: Flush cache of TDX private memory Kai Huang
                   ` (6 subsequent siblings)
  22 siblings, 1 reply; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

After the global KeyID has been configured on all packages, initialize
all TDMRs to make all TDX-usable memory regions that are passed to the
TDX module become usable.

This is the last step of initializing the TDX module.

Initializing TDMRs can be time consuming on large memory systems as it
involves initializing all metadata entries for all pages that can be
used by TDX guests.  Initializing different TDMRs can be parallelized.
For now to keep it simple, just initialize all TDMRs one by one.  It can
be enhanced in the future.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v11 -> v12:
 - Added Kirill's tag

v10 -> v11:
 - No update

v9 -> v10:
 - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'.

v8 -> v9:
 - Improved changlog to explain why initializing TDMRs can take long
   time (Dave).
 - Improved comments around 'next-to-initialize' address (Dave).

v7 -> v8: (Dave)
 - Changelog:
   - explicitly call out this is the last step of TDX module initialization.
   - Trimed down changelog by removing SEAMCALL name and details.
 - Removed/trimmed down unnecessary comments.
 - Other changes due to 'struct tdmr_info_list'.

v6 -> v7:
 - Removed need_resched() check. -- Andi.


---
 arch/x86/virt/vmx/tdx/tdx.c | 60 ++++++++++++++++++++++++++++++++-----
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index f5d4dbc11aee..52b7267ea226 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -994,6 +994,56 @@ static int config_global_keyid(void)
 	return ret;
 }
 
+static int init_tdmr(struct tdmr_info *tdmr)
+{
+	u64 next;
+
+	/*
+	 * Initializing a TDMR can be time consuming.  To avoid long
+	 * SEAMCALLs, the TDX module may only initialize a part of the
+	 * TDMR in each call.
+	 */
+	do {
+		struct tdx_module_output out;
+		int ret;
+
+		/* All 0's are unused parameters, they mean nothing. */
+		ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
+				&out);
+		if (ret)
+			return ret;
+		/*
+		 * RDX contains 'next-to-initialize' address if
+		 * TDH.SYS.TDMR.INIT did not fully complete and
+		 * should be retried.
+		 */
+		next = out.rdx;
+		cond_resched();
+		/* Keep making SEAMCALLs until the TDMR is done */
+	} while (next < tdmr->base + tdmr->size);
+
+	return 0;
+}
+
+static int init_tdmrs(struct tdmr_info_list *tdmr_list)
+{
+	int i;
+
+	/*
+	 * This operation is costly.  It can be parallelized,
+	 * but keep it simple for now.
+	 */
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		int ret;
+
+		ret = init_tdmr(tdmr_entry(tdmr_list, i));
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 static int init_tdx_module(void)
 {
 	struct tdsysinfo_struct *sysinfo;
@@ -1067,14 +1117,8 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_reset_pamts;
 
-	/*
-	 * TODO:
-	 *
-	 *  - Initialize all TDMRs.
-	 *
-	 *  Return error before all steps are done.
-	 */
-	ret = -EINVAL;
+	/* Initialize TDMRs to complete the TDX module initialization */
+	ret = init_tdmrs(&tdmr_list);
 out_reset_pamts:
 	if (ret) {
 		/*
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index a0438513bec0..f6b4e153890d 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -25,6 +25,7 @@
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
+#define TDH_SYS_TDMR_INIT	36
 #define TDH_SYS_CONFIG		45
 
 struct cmr_info {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 17/22] x86/kexec: Flush cache of TDX private memory
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (15 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 16/22] x86/virt/tdx: Initialize all TDMRs Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-26 14:12 ` [PATCH v12 18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful Kai Huang
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages; 2) There might be dirty cachelines associated
with TDX private pages.

The first problem doesn't matter on the platforms w/o the "partial write
machine check" erratum.  KeyID 0 doesn't have integrity check.  If the
new kernel wants to use any non-zero KeyID, it needs to convert the
memory to that KeyID and such conversion would work from any KeyID.

However the old kernel needs to guarantee there's no dirty cacheline
left behind before booting to the new kernel to avoid silent corruption
from later cacheline writeback (Intel hardware doesn't guarantee cache
coherency across different KeyIDs).

There are two things that the old kernel needs to do to achieve that:

1) Stop accessing TDX private memory mappings:
   a. Stop making TDX module SEAMCALLs (TDX global KeyID);
   b. Stop TDX guests from running (per-guest TDX KeyID).
2) Flush any cachelines from previous TDX private KeyID writes.

For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME
support.  And in this way 1) happens for free as there's no TDX activity
between wbinvd() and the native_halt().

Flushing cache in stop_this_cpu() only flushes cache on remote cpus.  On
the rebooting cpu which does kexec(), unlike SME which does the cache
flush in relocate_kernel(), flush the cache right after stopping remote
cpus in machine_shutdown().

There are two reasons to do so: 1) For TDX there's no need to defer
cache flush to relocate_kernel() because all TDX activities have been
stopped.  2) On the platforms with the above erratum the kernel must
convert all TDX private pages back to normal before booting to the new
kernel in kexec(), and flushing cache early allows the kernel to convert
memory early rather than having to muck with the relocate_kernel()
assembly.

Theoretically, cache flush is only needed when the TDX module has been
initialized.  However initializing the TDX module is done on demand at
runtime, and it takes a mutex to read the module status.  Just check
whether TDX is enabled by the BIOS instead to flush cache.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v11 -> v12:
 - Changed comment/changelog to say kernel doesn't try to handle fast
   warm reset but depends on BIOS to enable workaround (Kirill)
 - Added Kirill's tag

v10 -> v11:
 - Fixed a bug that cache for rebooting cpu isn't flushed for TDX private
   memory.
 - Updated changelog accordingly.

v9 -> v10:
 - No change.

v8 -> v9:
 - Various changelog enhancement and fix (Dave).
 - Improved comment (Dave).

v7 -> v8:
 - Changelog:
   - Removed "leave TDX module open" part due to shut down patch has been
     removed.

v6 -> v7:
 - Improved changelog to explain why don't convert TDX private pages back
   to normal.


---
 arch/x86/kernel/process.c |  7 ++++++-
 arch/x86/kernel/reboot.c  | 15 +++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index dac41a0072ea..0ce66deb9bc8 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -780,8 +780,13 @@ void __noreturn stop_this_cpu(void *dummy)
 	 *
 	 * Test the CPUID bit directly because the machine might've cleared
 	 * X86_FEATURE_SME due to cmdline options.
+	 *
+	 * The TDX module or guests might have left dirty cachelines
+	 * behind.  Flush them to avoid corruption from later writeback.
+	 * Note that this flushes on all systems where TDX is possible,
+	 * but does not actually check that TDX was in use.
 	 */
-	if (cpuid_eax(0x8000001f) & BIT(0))
+	if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
 		native_wbinvd();
 	for (;;) {
 		/*
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 3adbe97015c1..ae7480a213a6 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -32,6 +32,7 @@
 #include <asm/realmode.h>
 #include <asm/x86_init.h>
 #include <asm/efi.h>
+#include <asm/tdx.h>
 
 /*
  * Power off function, if any
@@ -695,6 +696,20 @@ void native_machine_shutdown(void)
 	local_irq_disable();
 	stop_other_cpus();
 #endif
+	/*
+	 * stop_other_cpus() has flushed all dirty cachelines of TDX
+	 * private memory on remote cpus.  Unlike SME, which does the
+	 * cache flush on _this_ cpu in the relocate_kernel(), flush
+	 * the cache for _this_ cpu here.  This is because on the
+	 * platforms with "partial write machine check" erratum the
+	 * kernel needs to convert all TDX private pages back to normal
+	 * before booting to the new kernel in kexec(), and the cache
+	 * flush must be done before that.  If the kernel took SME's way,
+	 * it would have to muck with the relocate_kernel() assembly to
+	 * do memory conversion.
+	 */
+	if (platform_tdx_enabled())
+		native_wbinvd();
 
 	lapic_shutdown();
 	restore_boot_irq_mode();
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (16 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 17/22] x86/kexec: Flush cache of TDX private memory Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-28  9:04   ` Nikolay Borisov
  2023-06-28 12:23   ` kirill.shutemov
  2023-06-26 14:12 ` [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum Kai Huang
                   ` (4 subsequent siblings)
  22 siblings, 2 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

On the platforms with the "partial write machine check" erratum, the
kexec() needs to convert all TDX private pages back to normal before
booting to the new kernel.  Otherwise, the new kernel may get unexpected
machine check.

There's no existing infrastructure to track TDX private pages.  Change
to keep TDMRs when module initialization is successful so that they can
be used to find PAMTs.

With this change, only put_online_mems() and freeing the buffer of the
TDSYSINFO_STRUCT and CMR array still need to be done even when module
initialization is successful.  Adjust the error handling to explicitly
do them when module initialization is successful and unconditionally
clean up the rest when initialization fails.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v11 -> v12 (new patch):
  - Defer keeping TDMRs logic to this patch for better review
  - Improved error handling logic (Nikolay/Kirill in patch 15)

---
 arch/x86/virt/vmx/tdx/tdx.c | 84 ++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 42 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 52b7267ea226..85b24b2e9417 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -49,6 +49,8 @@ static DEFINE_MUTEX(tdx_module_lock);
 /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
 static LIST_HEAD(tdx_memlist);
 
+static struct tdmr_info_list tdx_tdmr_list;
+
 /*
  * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
  * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
@@ -1047,7 +1049,6 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
 static int init_tdx_module(void)
 {
 	struct tdsysinfo_struct *sysinfo;
-	struct tdmr_info_list tdmr_list;
 	struct cmr_info *cmr_array;
 	int ret;
 
@@ -1088,17 +1089,17 @@ static int init_tdx_module(void)
 		goto out_put_tdxmem;
 
 	/* Allocate enough space for constructing TDMRs */
-	ret = alloc_tdmr_list(&tdmr_list, sysinfo);
+	ret = alloc_tdmr_list(&tdx_tdmr_list, sysinfo);
 	if (ret)
 		goto out_free_tdxmem;
 
 	/* Cover all TDX-usable memory regions in TDMRs */
-	ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo);
+	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, sysinfo);
 	if (ret)
 		goto out_free_tdmrs;
 
 	/* Pass the TDMRs and the global KeyID to the TDX module */
-	ret = config_tdx_module(&tdmr_list, tdx_global_keyid);
+	ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid);
 	if (ret)
 		goto out_free_pamts;
 
@@ -1118,51 +1119,50 @@ static int init_tdx_module(void)
 		goto out_reset_pamts;
 
 	/* Initialize TDMRs to complete the TDX module initialization */
-	ret = init_tdmrs(&tdmr_list);
+	ret = init_tdmrs(&tdx_tdmr_list);
+	if (ret)
+		goto out_reset_pamts;
+
+	pr_info("%lu KBs allocated for PAMT.\n",
+			tdmrs_count_pamt_kb(&tdx_tdmr_list));
+
+	/*
+	 * @tdx_memlist is written here and read at memory hotplug time.
+	 * Lock out memory hotplug code while building it.
+	 */
+	put_online_mems();
+	/*
+	 * For now both @sysinfo and @cmr_array are only used during
+	 * module initialization, so always free them.
+	 */
+	free_page((unsigned long)sysinfo);
+
+	return 0;
 out_reset_pamts:
-	if (ret) {
-		/*
-		 * Part of PAMTs may already have been initialized by the
-		 * TDX module.  Flush cache before returning PAMTs back
-		 * to the kernel.
-		 */
-		wbinvd_on_all_cpus();
-		/*
-		 * According to the TDX hardware spec, if the platform
-		 * doesn't have the "partial write machine check"
-		 * erratum, any kernel read/write will never cause #MC
-		 * in kernel space, thus it's OK to not convert PAMTs
-		 * back to normal.  But do the conversion anyway here
-		 * as suggested by the TDX spec.
-		 */
-		tdmrs_reset_pamt_all(&tdmr_list);
-	}
+	/*
+	 * Part of PAMTs may already have been initialized by the
+	 * TDX module.  Flush cache before returning PAMTs back
+	 * to the kernel.
+	 */
+	wbinvd_on_all_cpus();
+	/*
+	 * According to the TDX hardware spec, if the platform
+	 * doesn't have the "partial write machine check"
+	 * erratum, any kernel read/write will never cause #MC
+	 * in kernel space, thus it's OK to not convert PAMTs
+	 * back to normal.  But do the conversion anyway here
+	 * as suggested by the TDX spec.
+	 */
+	tdmrs_reset_pamt_all(&tdx_tdmr_list);
 out_free_pamts:
-	if (ret)
-		tdmrs_free_pamt_all(&tdmr_list);
-	else
-		pr_info("%lu KBs allocated for PAMT.\n",
-				tdmrs_count_pamt_kb(&tdmr_list));
+	tdmrs_free_pamt_all(&tdx_tdmr_list);
 out_free_tdmrs:
-	/*
-	 * Always free the buffer of TDMRs as they are only used during
-	 * module initialization.
-	 */
-	free_tdmr_list(&tdmr_list);
+	free_tdmr_list(&tdx_tdmr_list);
 out_free_tdxmem:
-	if (ret)
-		free_tdx_memlist(&tdx_memlist);
+	free_tdx_memlist(&tdx_memlist);
 out_put_tdxmem:
-	/*
-	 * @tdx_memlist is written here and read at memory hotplug time.
-	 * Lock out memory hotplug code while building it.
-	 */
 	put_online_mems();
 out:
-	/*
-	 * For now both @sysinfo and @cmr_array are only used during
-	 * module initialization, so always free them.
-	 */
 	free_page((unsigned long)sysinfo);
 	return ret;
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (17 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-28  9:20   ` Nikolay Borisov
                     ` (2 more replies)
  2023-06-26 14:12 ` [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP Kai Huang
                   ` (3 subsequent siblings)
  22 siblings, 3 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

The first few generations of TDX hardware have an erratum.  A partial
write to a TDX private memory cacheline will silently "poison" the
line.  Subsequent reads will consume the poison and generate a machine
check.  According to the TDX hardware spec, neither of these things
should have happened.

== Background ==

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

== Problem ==

A fast warm reset doesn't reset TDX private memory.  Kexec() can also
boot into the new kernel directly.  Thus if the old kernel has enabled
TDX on the platform with this erratum, the new kernel may get unexpected
machine check.

Note that w/o this erratum any kernel read/write on TDX private memory
should never cause machine check, thus it's OK for the old kernel to
leave TDX private pages as is.

== Solution ==

In short, with this erratum, the kernel needs to explicitly convert all
TDX private pages back to normal to give the new kernel a clean slate
after kexec().  The BIOS is also expected to disable fast warm reset as
a workaround to this erratum, thus this implementation doesn't try to
reset TDX private memory for the reboot case in the kernel but depend on
the BIOS to enable the workaround.

For now TDX private memory can only be PAMT pages.  It would be ideal to
cover all types of TDX private memory here (TDX guest private pages and
Secure-EPT pages are yet to be implemented when TDX gets supported in
KVM), but there's no existing infrastructure to track TDX private pages.
It's not feasible to query the TDX module about page type either because
VMX has already been stopped when KVM receives the reboot notifier.

Another option is to blindly convert all memory pages.  But this may
bring non-trivial latency to kexec() on large memory systems (especially
when the number of TDX private pages is small).  Thus even with this
temporary solution, eventually it's better for the kernel to only reset
TDX private pages.  Also, it's problematic to convert all memory pages
because not all pages are mapped as writable in the direct-mapping.  The
kernel needs to switch to another page table which maps all pages as
writable (e.g., the identical-mapping table for kexec(), or a new page
table) to do so, but this looks overkill.

Therefore, rather than doing something dramatic, only reset PAMT pages
for now.  Do it in machine_kexec() to avoid additional overhead to the
machine reboot/shutdown as the kernel depends on the BIOS to disable
fast warm reset as a workaround for the reboot case.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v11 -> v12:
 - Changed comment/changelog to say kernel doesn't try to handle fast
   warm reset but depends on BIOS to enable workaround (Kirill)
 - Added a new tdx_may_has_private_mem to indicate system may have TDX
   private memory and PAMTs/TDMRs are stable to access. (Dave).
 - Use atomic_t for tdx_may_has_private_mem for build-in memory barrier
   (Dave)
 - Changed calling x86_platform.memory_shutdown() to calling
   tdx_reset_memory() directly from machine_kexec() to avoid overhead to
   normal reboot case.

v10 -> v11:
 - New patch


---
 arch/x86/include/asm/tdx.h         |  2 +
 arch/x86/kernel/machine_kexec_64.c |  9 ++++
 arch/x86/virt/vmx/tdx/tdx.c        | 79 ++++++++++++++++++++++++++++++
 3 files changed, 90 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 91416fd600cd..e95c9fbf52e4 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -100,10 +100,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 bool platform_tdx_enabled(void);
 int tdx_cpu_enable(void);
 int tdx_enable(void);
+void tdx_reset_memory(void);
 #else	/* !CONFIG_INTEL_TDX_HOST */
 static inline bool platform_tdx_enabled(void) { return false; }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
 static inline int tdx_enable(void)  { return -ENODEV; }
+static inline void tdx_reset_memory(void) { }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 1a3e2c05a8a5..232253bd7ccd 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -28,6 +28,7 @@
 #include <asm/setup.h>
 #include <asm/set_memory.h>
 #include <asm/cpu.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_ACPI
 /*
@@ -301,6 +302,14 @@ void machine_kexec(struct kimage *image)
 	void *control_page;
 	int save_ftrace_enabled;
 
+	/*
+	 * On the platform with "partial write machine check" erratum,
+	 * all TDX private pages need to be converted back to normal
+	 * before booting to the new kernel, otherwise the new kernel
+	 * may get unexpected machine check.
+	 */
+	tdx_reset_memory();
+
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
 		save_processor_state();
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 85b24b2e9417..1107f4227568 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -51,6 +51,8 @@ static LIST_HEAD(tdx_memlist);
 
 static struct tdmr_info_list tdx_tdmr_list;
 
+static atomic_t tdx_may_has_private_mem;
+
 /*
  * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
  * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
@@ -1113,6 +1115,17 @@ static int init_tdx_module(void)
 	 */
 	wbinvd_on_all_cpus();
 
+	/*
+	 * Starting from this point the system may have TDX private
+	 * memory.  Make it globally visible so tdx_reset_memory() only
+	 * reads TDMRs/PAMTs when they are stable.
+	 *
+	 * Note using atomic_inc_return() to provide the explicit memory
+	 * ordering isn't mandatory here as the WBINVD above already
+	 * does that.  Compiler barrier isn't needed here either.
+	 */
+	atomic_inc_return(&tdx_may_has_private_mem);
+
 	/* Config the key of global KeyID on all packages */
 	ret = config_global_keyid();
 	if (ret)
@@ -1154,6 +1167,15 @@ static int init_tdx_module(void)
 	 * as suggested by the TDX spec.
 	 */
 	tdmrs_reset_pamt_all(&tdx_tdmr_list);
+	/*
+	 * No more TDX private pages now, and PAMTs/TDMRs are
+	 * going to be freed.  Make this globally visible so
+	 * tdx_reset_memory() can read stable TDMRs/PAMTs.
+	 *
+	 * Note atomic_dec_return(), which is an atomic RMW with
+	 * return value, always enforces the memory barrier.
+	 */
+	atomic_dec_return(&tdx_may_has_private_mem);
 out_free_pamts:
 	tdmrs_free_pamt_all(&tdx_tdmr_list);
 out_free_tdmrs:
@@ -1229,6 +1251,63 @@ int tdx_enable(void)
 }
 EXPORT_SYMBOL_GPL(tdx_enable);
 
+/*
+ * Convert TDX private pages back to normal on platforms with
+ * "partial write machine check" erratum.
+ *
+ * Called from machine_kexec() before booting to the new kernel.
+ */
+void tdx_reset_memory(void)
+{
+	if (!platform_tdx_enabled())
+		return;
+
+	/*
+	 * Kernel read/write to TDX private memory doesn't
+	 * cause machine check on hardware w/o this erratum.
+	 */
+	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+		return;
+
+	/* Called from kexec() when only rebooting cpu is alive */
+	WARN_ON_ONCE(num_online_cpus() != 1);
+
+	if (!atomic_read(&tdx_may_has_private_mem))
+		return;
+
+	/*
+	 * Ideally it's better to cover all types of TDX private pages,
+	 * but there's no existing infrastructure to tell whether a page
+	 * is TDX private memory or not.  Using SEAMCALL to query TDX
+	 * module isn't feasible either because: 1) VMX has been turned
+	 * off by reaching here so SEAMCALL cannot be made; 2) Even
+	 * SEAMCALL can be made the result from TDX module may not be
+	 * accurate (e.g., remote CPU can be stopped while the kernel
+	 * is in the middle of reclaiming one TDX private page and doing
+	 * MOVDIR64B).
+	 *
+	 * One solution could be just converting all memory pages, but
+	 * this may bring non-trivial latency on large memory systems
+	 * (especially when the number of TDX private pages is small).
+	 * So even with this temporary solution, eventually the kernel
+	 * should only convert TDX private pages.
+	 *
+	 * Also, not all pages are mapped as writable in direct mapping,
+	 * thus it's problematic to do so.  It can be done by switching
+	 * to the identical mapping table for kexec() or a new page table
+	 * which maps all pages as writable, but the complexity looks
+	 * overkill.
+	 *
+	 * Thus instead of doing something dramatic to convert all pages,
+	 * only convert PAMTs as for now TDX private pages can only be
+	 * PAMT.
+	 *
+	 * All other cpus are already dead.  TDMRs/PAMTs are stable when
+	 * @tdx_may_has_private_mem reads true.
+	 */
+	tdmrs_reset_pamt_all(&tdx_tdmr_list);
+}
+
 static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
 					    u32 *nr_tdx_keyids)
 {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (18 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-28 12:32   ` kirill.shutemov
  2023-06-28 15:29   ` Peter Zijlstra
  2023-06-26 14:12 ` [PATCH v12 21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum Kai Huang
                   ` (2 subsequent siblings)
  22 siblings, 2 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

On the platform with the "partial write machine check" erratum, a kernel
partial write to TDX private memory may cause unexpected machine check.
It would be nice if the #MC handler could print additional information
to show the #MC was TDX private memory error due to possible kernel bug.

To do that, the machine check handler needs to use SEAMCALL to query
page type of the error memory from the TDX module, because there's no
existing infrastructure to track TDX private pages.

SEAMCALL instruction causes #UD if CPU isn't in VMX operation.  In #MC
handler, it is legal that CPU isn't in VMX operation when making this
SEAMCALL.  Extend the TDX_MODULE_CALL macro to handle #UD so the
SEAMCALL can return error code instead of Oops in the #MC handler.
Opportunistically handles #GP too since they share the same code.

A bonus is when kernel mistakenly calls SEAMCALL when CPU isn't in VMX
operation, or when TDX isn't enabled by the BIOS, or when the BIOS is
buggy, the kernel can get a nicer error message rather than a less
understandable Oops.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v11 -> v12 (new patch):
  - Splitted out from "SEAMCALL infrastructure" patch for better review.
  - Provide justification in changelog (Dave/David)

---
 arch/x86/include/asm/tdx.h      |  5 +++++
 arch/x86/virt/vmx/tdx/tdx.c     |  7 +++++++
 arch/x86/virt/vmx/tdx/tdxcall.S | 19 +++++++++++++++++--
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index e95c9fbf52e4..8d3f85bcccc1 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,6 +8,8 @@
 #include <asm/ptrace.h>
 #include <asm/shared/tdx.h>
 
+#include <asm/trapnr.h>
+
 /*
  * SW-defined error codes.
  *
@@ -18,6 +20,9 @@
 #define TDX_SW_ERROR			(TDX_ERROR | GENMASK_ULL(47, 40))
 #define TDX_SEAMCALL_VMFAILINVALID	(TDX_SW_ERROR | _UL(0xFFFF0000))
 
+#define TDX_SEAMCALL_GP			(TDX_SW_ERROR | X86_TRAP_GP)
+#define TDX_SEAMCALL_UD			(TDX_SW_ERROR | X86_TRAP_UD)
+
 #ifndef __ASSEMBLY__
 
 /* TDX supported page sizes from the TDX module ABI. */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 1107f4227568..eba7ff91206d 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -93,6 +93,13 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	case TDX_SEAMCALL_VMFAILINVALID:
 		pr_err_once("module is not loaded.\n");
 		return -ENODEV;
+	case TDX_SEAMCALL_GP:
+		pr_err_once("not enabled by BIOS.\n");
+		return -ENODEV;
+	case TDX_SEAMCALL_UD:
+		pr_err_once("SEAMCALL failed: CPU %d is not in VMX operation.\n",
+				cpu);
+		return -EINVAL;
 	default:
 		pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
 				cpu, fn, sret);
diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
index 49a54356ae99..757b0c34be10 100644
--- a/arch/x86/virt/vmx/tdx/tdxcall.S
+++ b/arch/x86/virt/vmx/tdx/tdxcall.S
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <asm/asm-offsets.h>
 #include <asm/tdx.h>
+#include <asm/asm.h>
 
 /*
  * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
@@ -45,6 +46,7 @@
 	/* Leave input param 2 in RDX */
 
 	.if \host
+1:
 	seamcall
 	/*
 	 * SEAMCALL instruction is essentially a VMExit from VMX root
@@ -57,10 +59,23 @@
 	 * This value will never be used as actual SEAMCALL error code as
 	 * it is from the Reserved status code class.
 	 */
-	jnc .Lno_vmfailinvalid
+	jnc .Lseamcall_out
 	mov $TDX_SEAMCALL_VMFAILINVALID, %rax
-.Lno_vmfailinvalid:
+	jmp .Lseamcall_out
+2:
+	/*
+	 * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
+	 * the trap number.  Convert the trap number to the TDX error
+	 * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
+	 *
+	 * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
+	 * only accepts 32-bit immediate at most.
+	 */
+	mov $TDX_SW_ERROR, %r12
+	orq %r12, %rax
 
+	_ASM_EXTABLE_FAULT(1b, 2b)
+.Lseamcall_out:
 	.else
 	tdcall
 	.endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (19 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-28 12:38   ` kirill.shutemov
  2023-07-07  7:26   ` Yuan Yao
  2023-06-26 14:12 ` [PATCH v12 22/22] Documentation/x86: Add documentation for TDX host support Kai Huang
  2023-06-28  7:04 ` [PATCH v12 00/22] TDX host kernel support Yuan Yao
  22 siblings, 2 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

The first few generations of TDX hardware have an erratum.  Triggering
it in Linux requires some kind of kernel bug involving relatively exotic
memory writes to TDX private memory and will manifest via
spurious-looking machine checks when reading the affected memory.

== Background ==

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

== Problem ==

A partial write to a TDX private memory cacheline will silently "poison"
the line.  Subsequent reads will consume the poison and generate a
machine check.  According to the TDX hardware spec, neither of these
things should have happened.

To add insult to injury, the Linux machine code will present these as a
literal "Hardware error" when they were, in fact, a software-triggered
issue.

== Solution ==

In the end, this issue is hard to trigger.  Rather than do something
rash (and incomplete) like unmap TDX private memory from the direct map,
improve the machine check handler.

Currently, the #MC handler doesn't distinguish whether the memory is
TDX private memory or not but just dump, for instance, below message:

 [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
 [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
 	...
 [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] Kernel panic - not syncing: Fatal local machine check

Which says "Hardware Error" and "Data load in unrecoverable area of
kernel".

Ideally, it's better for the log to say "software bug around TDX private
memory" instead of "Hardware Error".  But in reality the real hardware
memory error can happen, and sadly such software-triggered #MC cannot be
distinguished from the real hardware error.  Also, the error message is
used by userspace tool 'mcelog' to parse, so changing the output may
break userspace.

So keep the "Hardware Error".  The "Data load in unrecoverable area of
kernel" is also helpful, so keep it too.

Instead of modifying above error log, improve the error log by printing
additional TDX related message to make the log like:

  ...
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.

Adding this additional message requires determination of whether the
memory page is TDX private memory.  There is no existing infrastructure
to do that.  Add an interface to query the TDX module to fill this gap.

== Impact ==

This issue requires some kind of kernel bug to trigger.

TDX private memory should never be mapped UC/WC.  A partial write
originating from these mappings would require *two* bugs, first mapping
the wrong page, then writing the wrong memory.  It would also be
detectable using traditional memory corruption techniques like
DEBUG_PAGEALLOC.

MOVNTI (and friends) could cause this issue with something like a simple
buffer overrun or use-after-free on the direct map.  It should also be
detectable with normal debug techniques.

The one place where this might get nasty would be if the CPU read data
then wrote back the same data.  That would trigger this problem but
would not, for instance, set off mechanisms like slab redzoning because
it doesn't actually corrupt data.

With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
TDX private memory would first need to be incorrectly mapped into the
I/O space and then a later DMA to that mapping would actually cause the
poisoning event.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v11 -> v12:
 - Simplified #MC message (Dave/Kirill)
 - Slightly improved some comments.

v10 -> v11:
 - New patch


---
 arch/x86/include/asm/tdx.h     |   2 +
 arch/x86/kernel/cpu/mce/core.c |  33 +++++++++++
 arch/x86/virt/vmx/tdx/tdx.c    | 102 +++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h    |   5 ++
 4 files changed, 142 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8d3f85bcccc1..a697b359d8c6 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -106,11 +106,13 @@ bool platform_tdx_enabled(void);
 int tdx_cpu_enable(void);
 int tdx_enable(void);
 void tdx_reset_memory(void);
+bool tdx_is_private_mem(unsigned long phys);
 #else	/* !CONFIG_INTEL_TDX_HOST */
 static inline bool platform_tdx_enabled(void) { return false; }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
 static inline int tdx_enable(void)  { return -ENODEV; }
 static inline void tdx_reset_memory(void) { }
+static inline bool tdx_is_private_mem(unsigned long phys) { return false; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2eec60f50057..f71b649f4c82 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -52,6 +52,7 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/reboot.h>
+#include <asm/tdx.h>
 
 #include "internal.h"
 
@@ -228,11 +229,34 @@ static void wait_for_panic(void)
 	panic("Panicing machine check CPU died");
 }
 
+static const char *mce_memory_info(struct mce *m)
+{
+	if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
+		return NULL;
+
+	/*
+	 * Certain initial generations of TDX-capable CPUs have an
+	 * erratum.  A kernel non-temporal partial write to TDX private
+	 * memory poisons that memory, and a subsequent read of that
+	 * memory triggers #MC.
+	 *
+	 * However such #MC caused by software cannot be distinguished
+	 * from the real hardware #MC.  Just print additional message
+	 * to show such #MC may be result of the CPU erratum.
+	 */
+	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+		return NULL;
+
+	return !tdx_is_private_mem(m->addr) ? NULL :
+		"TDX private memory error. Possible kernel bug.";
+}
+
 static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
 {
 	struct llist_node *pending;
 	struct mce_evt_llist *l;
 	int apei_err = 0;
+	const char *memmsg;
 
 	/*
 	 * Allow instrumentation around external facilities usage. Not that it
@@ -283,6 +307,15 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
 	}
 	if (exp)
 		pr_emerg(HW_ERR "Machine check: %s\n", exp);
+	/*
+	 * Confidential computing platforms such as TDX platforms
+	 * may occur MCE due to incorrect access to confidential
+	 * memory.  Print additional information for such error.
+	 */
+	memmsg = mce_memory_info(final);
+	if (memmsg)
+		pr_emerg(HW_ERR "Machine check: %s\n", memmsg);
+
 	if (!fake_panic) {
 		if (panic_timeout == 0)
 			panic_timeout = mca_cfg.panic_timeout;
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index eba7ff91206d..5f96c2d866e5 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1315,6 +1315,108 @@ void tdx_reset_memory(void)
 	tdmrs_reset_pamt_all(&tdx_tdmr_list);
 }
 
+static bool is_pamt_page(unsigned long phys)
+{
+	struct tdmr_info_list *tdmr_list = &tdx_tdmr_list;
+	int i;
+
+	/*
+	 * This function is called from #MC handler, and theoretically
+	 * it could run in parallel with the TDX module initialization
+	 * on other logical cpus.  But it's not OK to hold mutex here
+	 * so just blindly check module status to make sure PAMTs/TDMRs
+	 * are stable to access.
+	 *
+	 * This may return inaccurate result in rare cases, e.g., when
+	 * #MC happens on a PAMT page during module initialization, but
+	 * this is fine as #MC handler doesn't need a 100% accurate
+	 * result.
+	 */
+	if (tdx_module_status != TDX_MODULE_INITIALIZED)
+		return false;
+
+	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
+		unsigned long base, size;
+
+		tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
+
+		if (phys >= base && phys < (base + size))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Return whether the memory page at the given physical address is TDX
+ * private memory or not.  Called from #MC handler do_machine_check().
+ *
+ * Note this function may not return an accurate result in rare cases.
+ * This is fine as the #MC handler doesn't need a 100% accurate result,
+ * because it cannot distinguish #MC between software bug and real
+ * hardware error anyway.
+ */
+bool tdx_is_private_mem(unsigned long phys)
+{
+	struct tdx_module_output out;
+	u64 sret;
+
+	if (!platform_tdx_enabled())
+		return false;
+
+	/* Get page type from the TDX module */
+	sret = __seamcall(TDH_PHYMEM_PAGE_RDMD, phys & PAGE_MASK,
+			0, 0, 0, &out);
+	/*
+	 * Handle the case that CPU isn't in VMX operation.
+	 *
+	 * KVM guarantees no VM is running (thus no TDX guest)
+	 * when there's any online CPU isn't in VMX operation.
+	 * This means there will be no TDX guest private memory
+	 * and Secure-EPT pages.  However the TDX module may have
+	 * been initialized and the memory page could be PAMT.
+	 */
+	if (sret == TDX_SEAMCALL_UD)
+		return is_pamt_page(phys);
+
+	/*
+	 * Any other failure means:
+	 *
+	 * 1) TDX module not loaded; or
+	 * 2) Memory page isn't managed by the TDX module.
+	 *
+	 * In either case, the memory page cannot be a TDX
+	 * private page.
+	 */
+	if (sret)
+		return false;
+
+	/*
+	 * SEAMCALL was successful -- read page type (via RCX):
+	 *
+	 *  - PT_NDA:	Page is not used by the TDX module
+	 *  - PT_RSVD:	Reserved for Non-TDX use
+	 *  - Others:	Page is used by the TDX module
+	 *
+	 * Note PAMT pages are marked as PT_RSVD but they are also TDX
+	 * private memory.
+	 *
+	 * Note: Even page type is PT_NDA, the memory page could still
+	 * be associated with TDX private KeyID if the kernel hasn't
+	 * explicitly used MOVDIR64B to clear the page.  Assume KVM
+	 * always does that after reclaiming any private page from TDX
+	 * gusets.
+	 */
+	switch (out.rcx) {
+	case PT_NDA:
+		return false;
+	case PT_RSVD:
+		return is_pamt_page(phys);
+	default:
+		return true;
+	}
+}
+
 static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
 					    u32 *nr_tdx_keyids)
 {
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index f6b4e153890d..2fefd688924c 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -21,6 +21,7 @@
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_PHYMEM_PAGE_RDMD	24
 #define TDH_SYS_KEY_CONFIG	31
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
@@ -28,6 +29,10 @@
 #define TDH_SYS_TDMR_INIT	36
 #define TDH_SYS_CONFIG		45
 
+/* TDX page types */
+#define	PT_NDA		0x0
+#define	PT_RSVD		0x1
+
 struct cmr_info {
 	u64	base;
 	u64	size;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* [PATCH v12 22/22] Documentation/x86: Add documentation for TDX host support
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (20 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum Kai Huang
@ 2023-06-26 14:12 ` Kai Huang
  2023-06-28  7:04 ` [PATCH v12 00/22] TDX host kernel support Yuan Yao
  22 siblings, 0 replies; 159+ messages in thread
From: Kai Huang @ 2023-06-26 14:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo, kai.huang

Add documentation for TDX host kernel support.  There is already one
file Documentation/x86/tdx.rst containing documentation for TDX guest
internals.  Also reuse it for TDX host kernel support.

Introduce a new level menu "TDX Guest Support" and move existing
materials under it, and add a new menu for TDX host kernel support.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v11 -> v12:
 - Removed "no CPUID/MSR to detect TDX module" related (Dave).
 - Fixed some spelling errors.

---
 Documentation/arch/x86/tdx.rst | 189 +++++++++++++++++++++++++++++++--
 1 file changed, 178 insertions(+), 11 deletions(-)

diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst
index dc8d9fd2c3f7..f8017a2a663e 100644
--- a/Documentation/arch/x86/tdx.rst
+++ b/Documentation/arch/x86/tdx.rst
@@ -10,6 +10,173 @@ encrypting the guest memory. In TDX, a special module running in a special
 mode sits between the host and the guest and manages the guest/host
 separation.
 
+TDX Host Kernel Support
+=======================
+
+TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
+a new isolated range pointed by the SEAM Ranger Register (SEAMRR).  A
+CPU-attested software module called 'the TDX module' runs inside the new
+isolated range to provide the functionalities to manage and run protected
+VMs.
+
+TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
+provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
+as TDX private KeyIDs, which are only accessible within the SEAM mode.
+BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
+
+Before the TDX module can be used to create and run protected VMs, it
+must be loaded into the isolated range and properly initialized.  The TDX
+architecture doesn't require the BIOS to load the TDX module, but the
+kernel assumes it is loaded by the BIOS.
+
+TDX boot-time detection
+-----------------------
+
+The kernel detects TDX by detecting TDX private KeyIDs during kernel
+boot.  Below dmesg shows when TDX is enabled by BIOS::
+
+  [..] tdx: BIOS enabled: private KeyID range: [16, 64).
+
+TDX module initialization
+---------------------------------------
+
+The kernel talks to the TDX module via the new SEAMCALL instruction.  The
+TDX module implements SEAMCALL leaf functions to allow the kernel to
+initialize it.
+
+If the TDX module isn't loaded, the SEAMCALL instruction fails with a
+special error.  In this case the kernel fails the module initialization
+and reports the module isn't loaded::
+
+  [..] tdx: Module isn't loaded.
+
+Initializing the TDX module consumes roughly ~1/256th system RAM size to
+use it as 'metadata' for the TDX memory.  It also takes additional CPU
+time to initialize those metadata along with the TDX module itself.  Both
+are not trivial.  The kernel initializes the TDX module at runtime on
+demand.
+
+Besides initializing the TDX module, a per-cpu initialization SEAMCALL
+must be done on one cpu before any other SEAMCALLs can be made on that
+cpu.
+
+The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to
+allow the user of TDX to enable the TDX module and enable TDX on local
+cpu.
+
+Making SEAMCALL requires the CPU already being in VMX operation (VMXON
+has been done).  For now both tdx_enable() and tdx_cpu_enable() don't
+handle VMXON internally, but depends on the caller to guarantee that.
+
+To enable TDX, the caller of TDX should: 1) hold read lock of CPU hotplug
+lock; 2) do VMXON and tdx_enable_cpu() on all online cpus successfully;
+3) call tdx_enable().  For example::
+
+        cpus_read_lock();
+        on_each_cpu(vmxon_and_tdx_cpu_enable());
+        ret = tdx_enable();
+        cpus_read_unlock();
+        if (ret)
+                goto no_tdx;
+        // TDX is ready to use
+
+And the caller of TDX must guarantee the tdx_cpu_enable() has been
+successfully done on any cpu before it wants to run any other SEAMCALL.
+A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug
+online callback, and refuse to online if tdx_cpu_enable() fails.
+
+User can consult dmesg to see the presence of the TDX module, and whether
+it has been initialized.
+
+If the TDX module is not loaded, dmesg shows below::
+
+  [..] tdx: TDX module is not loaded.
+
+If the TDX module is initialized successfully, dmesg shows something
+like below::
+
+  [..] tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
+  [..] tdx: 262668 KBs allocated for PAMT.
+  [..] tdx: TDX module initialized.
+
+If the TDX module failed to initialize, dmesg also shows it failed to
+initialize::
+
+  [..] tdx: TDX module initialization failed ...
+
+TDX Interaction to Other Kernel Components
+------------------------------------------
+
+TDX Memory Policy
+~~~~~~~~~~~~~~~~~
+
+TDX reports a list of "Convertible Memory Region" (CMR) to tell the
+kernel which memory is TDX compatible.  The kernel needs to build a list
+of memory regions (out of CMRs) as "TDX-usable" memory and pass those
+regions to the TDX module.  Once this is done, those "TDX-usable" memory
+regions are fixed during module's lifetime.
+
+To keep things simple, currently the kernel simply guarantees all pages
+in the page allocator are TDX memory.  Specifically, the kernel uses all
+system memory in the core-mm at the time of initializing the TDX module
+as TDX memory, and in the meantime, refuses to online any non-TDX-memory
+in the memory hotplug.
+
+This can be enhanced in the future, i.e. by allowing adding non-TDX
+memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
+and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
+needs to guarantee memory pages for TDX guests are always allocated from
+the "TDX-capable" nodes.
+
+Physical Memory Hotplug
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Note TDX assumes convertible memory is always physically present during
+machine's runtime.  A non-buggy BIOS should never support hot-removal of
+any convertible memory.  This implementation doesn't handle ACPI memory
+removal but depends on the BIOS to behave correctly.
+
+CPU Hotplug
+~~~~~~~~~~~
+
+TDX module requires the per-cpu initialization SEAMCALL (TDH.SYS.LP.INIT)
+must be done on one cpu before any other SEAMCALLs can be made on that
+cpu, including those involved during the module initialization.
+
+The kernel provides tdx_cpu_enable() to let the user of TDX to do it when
+the user wants to use a new cpu for TDX task.
+
+TDX doesn't support physical (ACPI) CPU hotplug.  During machine boot,
+TDX verifies all boot-time present logical CPUs are TDX compatible before
+enabling TDX.  A non-buggy BIOS should never support hot-add/removal of
+physical CPU.  Currently the kernel doesn't handle physical CPU hotplug,
+but depends on the BIOS to behave correctly.
+
+Note TDX works with CPU logical online/offline, thus the kernel still
+allows to offline logical CPU and online it again.
+
+Kexec()
+~~~~~~~
+
+There are two problems in terms of using kexec() to boot to a new kernel
+when the old kernel has enabled TDX: 1) Part of the memory pages are
+still TDX private pages; 2) There might be dirty cachelines associated
+with TDX private pages.
+
+The first problem doesn't matter.  KeyID 0 doesn't have integrity check.
+Even the new kernel wants use any non-zero KeyID, it needs to convert
+the memory to that KeyID and such conversion would work from any KeyID.
+
+However the old kernel needs to guarantee there's no dirty cacheline
+left behind before booting to the new kernel to avoid silent corruption
+from later cacheline writeback (Intel hardware doesn't guarantee cache
+coherency across different KeyIDs).
+
+Similar to AMD SME, the kernel just uses wbinvd() to flush cache before
+booting to the new kernel.
+
+TDX Guest Support
+=================
 Since the host cannot directly access guest registers or memory, much
 normal functionality of a hypervisor must be moved into the guest. This is
 implemented using a Virtualization Exception (#VE) that is handled by the
@@ -20,7 +187,7 @@ TDX includes new hypercall-like mechanisms for communicating from the
 guest to the hypervisor or the TDX module.
 
 New TDX Exceptions
-==================
+------------------
 
 TDX guests behave differently from bare-metal and traditional VMX guests.
 In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -30,7 +197,7 @@ Instructions marked with an '*' conditionally cause exceptions.  The
 details for these instructions are discussed below.
 
 Instruction-based #VE
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - Port I/O (INS, OUTS, IN, OUT)
 - HLT
@@ -41,7 +208,7 @@ Instruction-based #VE
 - CPUID*
 
 Instruction-based #GP
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
   VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -52,7 +219,7 @@ Instruction-based #GP
 - RDMSR*,WRMSR*
 
 RDMSR/WRMSR Behavior
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 MSR access behavior falls into three categories:
 
@@ -73,7 +240,7 @@ trapping and handling in the TDX module.  Other than possibly being slow,
 these MSRs appear to function just as they would on bare metal.
 
 CPUID Behavior
---------------
+~~~~~~~~~~~~~~
 
 For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
 return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -93,7 +260,7 @@ not know how to handle. The guest kernel may ask the hypervisor for the
 value with a hypercall.
 
 #VE on Memory Accesses
-======================
+----------------------
 
 There are essentially two classes of TDX memory: private and shared.
 Private memory receives full TDX protections.  Its content is protected
@@ -107,7 +274,7 @@ entries.  This helps ensure that a guest does not place sensitive
 information in shared memory, exposing it to the untrusted hypervisor.
 
 #VE on Shared Memory
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 Access to shared mappings can cause a #VE.  The hypervisor ultimately
 controls whether a shared memory access causes a #VE, so the guest must be
@@ -127,7 +294,7 @@ be careful not to access device MMIO regions unless it is also prepared to
 handle a #VE.
 
 #VE on Private Pages
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 An access to private mappings can also cause a #VE.  Since all kernel
 memory is also private memory, the kernel might theoretically need to
@@ -145,7 +312,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a
 to handle the exception.
 
 Linux #VE handler
-=================
+-----------------
 
 Just like page faults or #GP's, #VE exceptions can be either handled or be
 fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
@@ -167,7 +334,7 @@ While the block is in place, any #VE is elevated to a double fault (#DF)
 which is not recoverable.
 
 MMIO handling
-=============
+-------------
 
 In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
 mapping which will cause a VMEXIT on access, and then the hypervisor
@@ -189,7 +356,7 @@ MMIO access via other means (like structure overlays) may result in an
 oops.
 
 Shared Memory Conversions
-=========================
+-------------------------
 
 All TDX guest memory starts out as private at boot.  This memory can not
 be accessed by the hypervisor.  However, some kernel users like device
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-26 14:12 ` [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
@ 2023-06-26 21:21   ` Sathyanarayanan Kuppuswamy
  2023-06-27 10:37     ` Huang, Kai
  2023-06-27  9:50   ` kirill.shutemov
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 159+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2023-06-26 21:21 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, nik.borisov, bagasdotme,
	sagis, imammedo



On 6/26/23 7:12 AM, Kai Huang wrote:
> To enable TDX the kernel needs to initialize TDX from two perspectives:
> 1) Do a set of SEAMCALLs to initialize the TDX module to make it ready
> to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
> on one logical cpu before the kernel wants to make any other SEAMCALLs
> on that cpu (including those involved during module initialization and
> running TDX guests).
> 
> The TDX module can be initialized only once in its lifetime.  Instead
> of always initializing it at boot time, this implementation chooses an
> "on demand" approach to initialize TDX until there is a real need (e.g
> when requested by KVM).  This approach has below pros:
> 
> 1) It avoids consuming the memory that must be allocated by kernel and
> given to the TDX module as metadata (~1/256th of the TDX-usable memory),
> and also saves the CPU cycles of initializing the TDX module (and the
> metadata) when TDX is not used at all.
> 
> 2) The TDX module design allows it to be updated while the system is
> running.  The update procedure shares quite a few steps with this "on
> demand" initialization mechanism.  The hope is that much of "on demand"
> mechanism can be shared with a future "update" mechanism.  A boot-time
> TDX module implementation would not be able to share much code with the
> update mechanism.
> 
> 3) Making SEAMCALL requires VMX to be enabled.  Currently, only the KVM
> code mucks with VMX enabling.  If the TDX module were to be initialized
> separately from KVM (like at boot), the boot code would need to be
> taught how to muck with VMX enabling and KVM would need to be taught how
> to cope with that.  Making KVM itself responsible for TDX initialization
> lets the rest of the kernel stay blissfully unaware of VMX.
> 
> Similar to module initialization, also make the per-cpu initialization
> "on demand" as it also depends on VMX being enabled.
> 
> Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
> module and enable TDX on local cpu respectively.  For now tdx_enable()
> is a placeholder.  The TODO list will be pared down as functionality is
> added.
> 
> Export both tdx_cpu_enable() and tdx_enable() for KVM use.
> 
> In tdx_enable() use a state machine protected by mutex to make sure the
> initialization will only be done once, as tdx_enable() can be called
> multiple times (i.e. KVM module can be reloaded) and may be called
> concurrently by other kernel components in the future.
> 
> The per-cpu initialization on each cpu can only be done once during the
> module's life time.  Use a per-cpu variable to track its status to make
> sure it is only done once in tdx_cpu_enable().
> 
> Also, a SEAMCALL to do TDX module global initialization must be done
> once on any logical cpu before any per-cpu initialization SEAMCALL.  Do
> it inside tdx_cpu_enable() too (if hasn't been done).
> 
> tdx_enable() can potentially invoke SEAMCALLs on any online cpus.  The
> per-cpu initialization must be done before those SEAMCALLs are invoked
> on some cpu.  To keep things simple, in tdx_cpu_enable(), always do the
> per-cpu initialization regardless of whether the TDX module has been
> initialized or not.  And in tdx_enable(), don't call tdx_cpu_enable()
> but assume the caller has disabled CPU hotplug, done VMXON and
> tdx_cpu_enable() on all online cpus before calling tdx_enable().
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
> 
> v11 -> v12:
>  - Simplified TDX module global init and lp init status tracking (David).
>  - Added comment around try_init_module_global() for using
>    raw_spin_lock() (Dave).
>  - Added one sentence to changelog to explain why to expose tdx_enable()
>    and tdx_cpu_enable() (Dave).
>  - Simplifed comments around tdx_enable() and tdx_cpu_enable() to use
>    lockdep_assert_*() instead. (Dave)
>  - Removed redundent "TDX" in error message (Dave).
> 
> v10 -> v11:
>  - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off.
>  - Return the actual error code for tdx_enable() instead of -EINVAL.
>  - Added Isaku's Reviewed-by.
> 
> v9 -> v10:
>  - Merged the patch to handle per-cpu initialization to this patch to
>    tell the story better.
>  - Changed how to handle the per-cpu initialization to only provide a
>    tdx_cpu_enable() function to let the user of TDX to do it when the
>    user wants to run TDX code on a certain cpu.
>  - Changed tdx_enable() to not call cpus_read_lock() explicitly, but
>    call lockdep_assert_cpus_held() to assume the caller has done that.
>  - Improved comments around tdx_enable() and tdx_cpu_enable().
>  - Improved changelog to tell the story better accordingly.
> 
> v8 -> v9:
>  - Removed detailed TODO list in the changelog (Dave).
>  - Added back steps to do module global initialization and per-cpu
>    initialization in the TODO list comment.
>  - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h
> 
> v7 -> v8:
>  - Refined changelog (Dave).
>  - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
>  - Add a "TODO list" comment in init_tdx_module() to list all steps of
>    initializing the TDX Module to tell the story (Dave).
>  - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
>    comments (Dave).
>  - Simplified __tdx_enable() to only handle success or failure.
>  - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
>  - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
>  - Improved comments (Dave).
>  - Pointed out 'tdx_module_status' is software thing (Dave).
> 
> v6 -> v7:
>  - No change.
> 
> v5 -> v6:
>  - Added code to set status to TDX_MODULE_NONE if TDX module is not
>    loaded (Chao)
>  - Added Chao's Reviewed-by.
>  - Improved comments around cpus_read_lock().
> 
> - v3->v5 (no feedback on v4):
>  - Removed the check that SEAMRR and TDX KeyID have been detected on
>    all present cpus.
>  - Removed tdx_detect().
>  - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
>    hotplug lock and return early with error message.
>  - Improved dmesg printing for TDX module detection and initialization.
> 
> 
> ---
>  arch/x86/include/asm/tdx.h  |   4 +
>  arch/x86/virt/vmx/tdx/tdx.c | 162 ++++++++++++++++++++++++++++++++++++
>  arch/x86/virt/vmx/tdx/tdx.h |  13 +++
>  3 files changed, 179 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 4dfe2e794411..d8226a50c58c 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -97,8 +97,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>  
>  #ifdef CONFIG_INTEL_TDX_HOST
>  bool platform_tdx_enabled(void);
> +int tdx_cpu_enable(void);
> +int tdx_enable(void);
>  #else	/* !CONFIG_INTEL_TDX_HOST */
>  static inline bool platform_tdx_enabled(void) { return false; }
> +static inline int tdx_cpu_enable(void) { return -ENODEV; }
> +static inline int tdx_enable(void)  { return -ENODEV; }
>  #endif	/* CONFIG_INTEL_TDX_HOST */
>  
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 141d12376c4d..29ca18f66d61 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,10 @@
>  #include <linux/errno.h>
>  #include <linux/printk.h>
>  #include <linux/smp.h>
> +#include <linux/cpu.h>
> +#include <linux/spinlock.h>
> +#include <linux/percpu-defs.h>
> +#include <linux/mutex.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/archrandom.h>
> @@ -23,6 +27,13 @@ static u32 tdx_global_keyid __ro_after_init;
>  static u32 tdx_guest_keyid_start __ro_after_init;
>  static u32 tdx_nr_guest_keyids __ro_after_init;
>  
> +static bool tdx_global_initialized;
> +static DEFINE_RAW_SPINLOCK(tdx_global_init_lock);

Why use raw_spin_lock()?

> +static DEFINE_PER_CPU(bool, tdx_lp_initialized);
> +
> +static enum tdx_module_status_t tdx_module_status;
> +static DEFINE_MUTEX(tdx_module_lock);

I think you can add a single line comment about what states above
variables tracks. But it is entirely up to you.

> +
>  /*
>   * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
>   * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> @@ -74,6 +85,157 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  	}
>  }
>  
> +/*
> + * Do the module global initialization if not done yet.
> + * It's always called with interrupts and preemption disabled.
> + */
> +static int try_init_module_global(void)
> +{
> +	unsigned long flags;
> +	int ret;
> +
> +	/*
> +	 * The TDX module global initialization only needs to be done
> +	 * once on any cpu.
> +	 */
> +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> +
> +	if (tdx_global_initialized) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/* All '0's are just unused parameters. */

I have noticed that you add the above comment whenever you call seamcall() with
0 as parameters. Is this a ask from the maintainer? If not, I think you can skip
it. Just explaining the parameters in seamcall function definition is good
enough.

> +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> +	if (!ret)
> +		tdx_global_initialized = true;
> +out:
> +	raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
> +
> +	return ret;
> +}
> +
> +/**
> + * tdx_cpu_enable - Enable TDX on local cpu
> + *
> + * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
> + * global initialization SEAMCALL if not done) on local cpu to make this
> + * cpu be ready to run any other SEAMCALLs.
> + *
> + * Call this function with preemption disabled.
> + *
> + * Return 0 on success, otherwise errors.
> + */
> +int tdx_cpu_enable(void)
> +{
> +	int ret;
> +
> +	if (!platform_tdx_enabled())
> +		return -ENODEV;
> +
> +	lockdep_assert_preemption_disabled();
> +
> +	/* Already done */
> +	if (__this_cpu_read(tdx_lp_initialized))
> +		return 0;
> +
> +	/*
> +	 * The TDX module global initialization is the very first step
> +	 * to enable TDX.  Need to do it first (if hasn't been done)
> +	 * before the per-cpu initialization.
> +	 */
> +	ret = try_init_module_global();
> +	if (ret)
> +		return ret;
> +
> +	/* All '0's are just unused parameters */
> +	ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
> +	if (ret)
> +		return ret;
> +
> +	__this_cpu_write(tdx_lp_initialized, true);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> +
> +static int init_tdx_module(void)
> +{
> +	/*
> +	 * TODO:
> +	 *
> +	 *  - Get TDX module information and TDX-capable memory regions.
> +	 *  - Build the list of TDX-usable memory regions.
> +	 *  - Construct a list of "TD Memory Regions" (TDMRs) to cover
> +	 *    all TDX-usable memory regions.
> +	 *  - Configure the TDMRs and the global KeyID to the TDX module.
> +	 *  - Configure the global KeyID on all packages.
> +	 *  - Initialize all TDMRs.
> +	 *
> +	 *  Return error before all steps are done.
> +	 */
> +	return -EINVAL;
> +}
> +
> +static int __tdx_enable(void)
> +{
> +	int ret;
> +
> +	ret = init_tdx_module();
> +	if (ret) {
> +		pr_err("module initialization failed (%d)\n", ret);
> +		tdx_module_status = TDX_MODULE_ERROR;
> +		return ret;
> +	}
> +
> +	pr_info("module initialized.\n");
> +	tdx_module_status = TDX_MODULE_INITIALIZED;
> +
> +	return 0;
> +}
> +
> +/**
> + * tdx_enable - Enable TDX module to make it ready to run TDX guests
> + *
> + * This function assumes the caller has: 1) held read lock of CPU hotplug
> + * lock to prevent any new cpu from becoming online; 2) done both VMXON
> + * and tdx_cpu_enable() on all online cpus.
> + *
> + * This function can be called in parallel by multiple callers.
> + *
> + * Return 0 if TDX is enabled successfully, otherwise error.
> + */
> +int tdx_enable(void)
> +{
> +	int ret;
> +
> +	if (!platform_tdx_enabled())
> +		return -ENODEV;
> +
> +	lockdep_assert_cpus_held();
> +
> +	mutex_lock(&tdx_module_lock);
> +
> +	switch (tdx_module_status) {
> +	case TDX_MODULE_UNKNOWN:
> +		ret = __tdx_enable();
> +		break;
> +	case TDX_MODULE_INITIALIZED:
> +		/* Already initialized, great, tell the caller. */
> +		ret = 0;
> +		break;
> +	default:
> +		/* Failed to initialize in the previous attempts */
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	mutex_unlock(&tdx_module_lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(tdx_enable);
> +
>  static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
>  					    u32 *nr_tdx_keyids)
>  {
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 55dbb1b8c971..9fb46033c852 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -16,11 +16,24 @@
>   */
>  #define TDX_RND_NO_ENTROPY	0x8000020300000000ULL
>  
> +/*
> + * TDX module SEAMCALL leaf functions
> + */
> +#define TDH_SYS_INIT		33
> +#define TDH_SYS_LP_INIT		35
> +
>  /*
>   * Do not put any hardware-defined TDX structure representations below
>   * this comment!
>   */
>  
> +/* Kernel defined TDX module status during module initialization. */
> +enum tdx_module_status_t {
> +	TDX_MODULE_UNKNOWN,
> +	TDX_MODULE_INITIALIZED,
> +	TDX_MODULE_ERROR
> +};
> +
>  struct tdx_module_output;
>  u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  	       struct tdx_module_output *out);

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-26 14:12 ` [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure Kai Huang
@ 2023-06-27  9:48   ` kirill.shutemov
  2023-06-27 10:28     ` Huang, Kai
  2023-06-28  3:09   ` Chao Gao
  2023-06-28 12:58   ` Peter Zijlstra
  2 siblings, 1 reply; 159+ messages in thread
From: kirill.shutemov @ 2023-06-27  9:48 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo

On Tue, Jun 27, 2023 at 02:12:35AM +1200, Kai Huang wrote:
> +/*
> + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> + * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> + * leaf function return code and the additional output respectively if
> + * not NULL.
> + */
> +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> +				    u64 *seamcall_ret,
> +				    struct tdx_module_output *out)
> +{
> +	u64 sret;
> +	int cpu;
> +
> +	/* Need a stable CPU id for printing error message */
> +	cpu = get_cpu();
> +	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> +	put_cpu();
> +
> +	/* Save SEAMCALL return code if the caller wants it */
> +	if (seamcall_ret)
> +		*seamcall_ret = sret;
> +
> +	switch (sret) {
> +	case 0:
> +		/* SEAMCALL was successful */
> +		return 0;
> +	case TDX_SEAMCALL_VMFAILINVALID:
> +		pr_err_once("module is not loaded.\n");
> +		return -ENODEV;
> +	default:
> +		pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
> +				cpu, fn, sret);
> +		if (out)
> +			pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
> +					out->rcx, out->rdx, out->r8,
> +					out->r9, out->r10, out->r11);

This look excessively noisy.

Don't we have SEAMCALL leafs that can fail in normal situation? Like
TDX_OPERAND_BUSY error code that indicate that operation likely will
succeed on retry.

Or is that wrapper only used for never-fail SEAMCALLs? If so, please
document it.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-26 14:12 ` [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
  2023-06-26 21:21   ` Sathyanarayanan Kuppuswamy
@ 2023-06-27  9:50   ` kirill.shutemov
  2023-06-27 10:34     ` Huang, Kai
  2023-06-28 13:04   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 159+ messages in thread
From: kirill.shutemov @ 2023-06-27  9:50 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo

On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> +	/*
> +	 * The TDX module global initialization only needs to be done
> +	 * once on any cpu.
> +	 */
> +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);

I don't understand how the comment justifies using raw spin lock.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2023-06-26 14:12 ` [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
@ 2023-06-27  9:51   ` kirill.shutemov
  2023-06-27 10:45     ` Huang, Kai
  2023-06-28 14:10   ` Peter Zijlstra
  1 sibling, 1 reply; 159+ messages in thread
From: kirill.shutemov @ 2023-06-27  9:51 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo

On Tue, Jun 27, 2023 at 02:12:38AM +1200, Kai Huang wrote:
>  static int init_tdx_module(void)
>  {
> +	struct tdsysinfo_struct *sysinfo;
> +	struct cmr_info *cmr_array;
> +	int ret;
> +
> +	/*
> +	 * Get the TDSYSINFO_STRUCT and CMRs from the TDX module.
> +	 *
> +	 * The buffers of the TDSYSINFO_STRUCT and the CMR array passed
> +	 * to the TDX module must be 1024-bytes and 512-bytes aligned
> +	 * respectively.  Allocate one page to accommodate them both and
> +	 * also meet those alignment requirements.
> +	 */
> +	sysinfo = (struct tdsysinfo_struct *)__get_free_page(GFP_KERNEL);
> +	if (!sysinfo)
> +		return -ENOMEM;
> +	cmr_array = (struct cmr_info *)((unsigned long)sysinfo + PAGE_SIZE / 2);
> +
> +	BUILD_BUG_ON(PAGE_SIZE / 2 < TDSYSINFO_STRUCT_SIZE);
> +	BUILD_BUG_ON(PAGE_SIZE / 2 < sizeof(struct cmr_info) * MAX_CMRS);

This works, but why not just use slab for this? kmalloc has 512 and 1024
pools already and you won't waste memory for rounding up.

Something like this:

        sysinfo = kmalloc(TDSYSINFO_STRUCT_SIZE, GFP_KERNEL);
        if (!sysinfo)
                return -ENOMEM;

        cmr_array_size = sizeof(struct cmr_info) * MAX_CMRS;

        /* CMR array has to be 512-aligned */
        cmr_array_size = round_up(cmr_array_size, 512);

        cmr_array = kmalloc(cmr_array_size, GFP_KERNEL);
        if (!cmr_array) {
                kfree(sysinfo);
                return -ENOMEM;
        }

?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2023-06-26 14:12 ` [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
@ 2023-06-27  9:51   ` kirill.shutemov
  2023-07-04  7:40   ` Yuan Yao
  2023-07-11 11:42   ` David Hildenbrand
  2 siblings, 0 replies; 159+ messages in thread
From: kirill.shutemov @ 2023-06-27  9:51 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo

On Tue, Jun 27, 2023 at 02:12:42AM +1200, Kai Huang wrote:
> The TDX module uses additional metadata to record things like which
> guest "owns" a given page of memory.  This metadata, referred as
> Physical Address Metadata Table (PAMT), essentially serves as the
> 'struct page' for the TDX module.  PAMTs are not reserved by hardware
> up front.  They must be allocated by the kernel and then given to the
> TDX module during module initialization.
> 
> TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
> (TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
> be a physically contiguous area from a Convertible Memory Region (CMR).
> However, the PAMTs which track pages in one TDMR do not need to reside
> within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
> any TDMR, the overlapping part must be reported as a reserved area in
> that particular TDMR.
> 
> Use alloc_contig_pages() since PAMT must be a physically contiguous area
> and it may be potentially large (~1/256th of the size of the given TDMR).
> The downside is alloc_contig_pages() may fail at runtime.  One (bad)
> mitigation is to launch a TDX guest early during system boot to get
> those PAMTs allocated at early time, but the only way to fix is to add a
> boot option to allocate or reserve PAMTs during kernel boot.
> 
> It is imperfect but will be improved on later.
> 
> TDX only supports a limited number of reserved areas per TDMR to cover
> both PAMTs and memory holes within the given TDMR.  If many PAMTs are
> allocated within a single TDMR, the reserved areas may not be sufficient
> to cover all of them.
> 
> Adopt the following policies when allocating PAMTs for a given TDMR:
> 
>   - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
>     the total number of reserved areas consumed for PAMTs.
>   - Try to first allocate PAMT from the local node of the TDMR for better
>     NUMA locality.
> 
> Also dump out how many pages are allocated for PAMTs when the TDX module
> is initialized successfully.  This helps answer the eternal "where did
> all my memory go?" questions.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-27  9:48   ` kirill.shutemov
@ 2023-06-27 10:28     ` Huang, Kai
  2023-06-27 11:36       ` kirill.shutemov
  2023-06-28  0:19       ` Isaku Yamahata
  0 siblings, 2 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-27 10:28 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Yamahata, Isaku, Chatre,
	Reinette, nik.borisov, hpa, peterz, Shahar, Sagi, imammedo, bp,
	Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
	x86, Williams, Dan J

On Tue, 2023-06-27 at 12:48 +0300, kirill.shutemov@linux.intel.com wrote:
> On Tue, Jun 27, 2023 at 02:12:35AM +1200, Kai Huang wrote:
> > +/*
> > + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > + * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> > + * leaf function return code and the additional output respectively if
> > + * not NULL.
> > + */
> > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> > +				    u64 *seamcall_ret,
> > +				    struct tdx_module_output *out)
> > +{
> > +	u64 sret;
> > +	int cpu;
> > +
> > +	/* Need a stable CPU id for printing error message */
> > +	cpu = get_cpu();
> > +	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > +	put_cpu();
> > +
> > +	/* Save SEAMCALL return code if the caller wants it */
> > +	if (seamcall_ret)
> > +		*seamcall_ret = sret;
> > +
> > +	switch (sret) {
> > +	case 0:
> > +		/* SEAMCALL was successful */
> > +		return 0;
> > +	case TDX_SEAMCALL_VMFAILINVALID:
> > +		pr_err_once("module is not loaded.\n");
> > +		return -ENODEV;
> > +	default:
> > +		pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
> > +				cpu, fn, sret);
> > +		if (out)
> > +			pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
> > +					out->rcx, out->rdx, out->r8,
> > +					out->r9, out->r10, out->r11);
> 
> This look excessively noisy.
> 
> Don't we have SEAMCALL leafs that can fail in normal situation? Like
> TDX_OPERAND_BUSY error code that indicate that operation likely will
> succeed on retry.

For TDX module initialization TDX_OPERAND_BUSY cannot happen.  KVM may have
legal cases that BUSY can happen, e.g., KVM's TDP MMU supports handling faults
concurrently on different cpus, but that is still under discussion.  Also KVM
tends to use __seamcall() directly:

https://lore.kernel.org/lkml/3c2c142e14a04a833b47f77faecaa91899b472cd.1678643052.git.isaku.yamahata@intel.com/

I guess KVM doesn't want to print message in all cases as you said, but for
module initialization is fine.  Those error messages are useful in case
something goes wrong, and printing them in seamcall() helps to reduce the code
to print in all callers.

> 
> Or is that wrapper only used for never-fail SEAMCALLs? If so, please
> document it.
> 

How about adding below?

	Use __seamcall() directly in cases that printing error message isn't
	desired, e.g., when SEAMCALL can legally fail with BUSY and the caller
	wants to retry.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-27  9:50   ` kirill.shutemov
@ 2023-06-27 10:34     ` Huang, Kai
  2023-06-27 12:18       ` kirill.shutemov
  0 siblings, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-27 10:34 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Yamahata, Isaku, Chatre,
	Reinette, nik.borisov, hpa, peterz, Shahar, Sagi, imammedo, bp,
	Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
	x86, Williams, Dan J

On Tue, 2023-06-27 at 12:50 +0300, kirill.shutemov@linux.intel.com wrote:
> On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > +	/*
> > +	 * The TDX module global initialization only needs to be done
> > +	 * once on any cpu.
> > +	 */
> > +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> 
> I don't understand how the comment justifies using raw spin lock.
> 

This comment is for using lock in general.  The reason to use raw_ version is
because this function gets called in IRQ context, and for PREEMPT_RT kernel the
normal spinlock is converted to sleeping lock.

Dave suggested to comment on the function rather than comment on the
raw_spin_lock directly, e.g.,  no other kernel code does that:

https://lore.kernel.org/linux-mm/d2b3bc5e-1371-0c50-8ecb-64fc70917d42@intel.com/

So I commented the function in this version:

+/*
+ * Do the module global initialization if not done yet.
+ * It's always called with interrupts and preemption disabled.
+ */

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-26 21:21   ` Sathyanarayanan Kuppuswamy
@ 2023-06-27 10:37     ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-27 10:37 UTC (permalink / raw)
  To: sathyanarayanan.kuppuswamy, kvm, linux-kernel
  Cc: Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, kirill.shutemov, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Yamahata, Isaku, Chatre,
	Reinette, nik.borisov, hpa, peterz, Shahar, Sagi, imammedo, bp,
	Gao, Chao, Brown, Len, Huang, Ying, x86, Williams, Dan J

On Mon, 2023-06-26 at 14:21 -0700, Sathyanarayanan Kuppuswamy wrote:
> > +	/* All '0's are just unused parameters. */
> 
> I have noticed that you add the above comment whenever you call seamcall()
> with
> 0 as parameters. Is this a ask from the maintainer? If not, I think you can
> skip
> it. Just explaining the parameters in seamcall function definition is good
> enough.

Yes I followed maintainer (I didn't bother to find the exact link this time,
though).  I think in this way we don't need to go to TDX module spec to check
whether 0 has meaning in each SEAMCALL, especially in code review.  I kinda
agree having them in multiple places is a little bit noisy, but I don't have a
better way.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2023-06-27  9:51   ` kirill.shutemov
@ 2023-06-27 10:45     ` Huang, Kai
  2023-06-27 11:37       ` kirill.shutemov
  0 siblings, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-27 10:45 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Yamahata, Isaku, Chatre,
	Reinette, nik.borisov, hpa, peterz, Shahar, Sagi, imammedo, bp,
	Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
	x86, Williams, Dan J

On Tue, 2023-06-27 at 12:51 +0300, kirill.shutemov@linux.intel.com wrote:
> On Tue, Jun 27, 2023 at 02:12:38AM +1200, Kai Huang wrote:
> >  static int init_tdx_module(void)
> >  {
> > +	struct tdsysinfo_struct *sysinfo;
> > +	struct cmr_info *cmr_array;
> > +	int ret;
> > +
> > +	/*
> > +	 * Get the TDSYSINFO_STRUCT and CMRs from the TDX module.
> > +	 *
> > +	 * The buffers of the TDSYSINFO_STRUCT and the CMR array passed
> > +	 * to the TDX module must be 1024-bytes and 512-bytes aligned
> > +	 * respectively.  Allocate one page to accommodate them both and
> > +	 * also meet those alignment requirements.
> > +	 */
> > +	sysinfo = (struct tdsysinfo_struct *)__get_free_page(GFP_KERNEL);
> > +	if (!sysinfo)
> > +		return -ENOMEM;
> > +	cmr_array = (struct cmr_info *)((unsigned long)sysinfo + PAGE_SIZE / 2);
> > +
> > +	BUILD_BUG_ON(PAGE_SIZE / 2 < TDSYSINFO_STRUCT_SIZE);
> > +	BUILD_BUG_ON(PAGE_SIZE / 2 < sizeof(struct cmr_info) * MAX_CMRS);
> 
> This works, but why not just use slab for this? kmalloc has 512 and 1024
> pools already and you won't waste memory for rounding up.
> 
> Something like this:
> 
>         sysinfo = kmalloc(TDSYSINFO_STRUCT_SIZE, GFP_KERNEL);
>         if (!sysinfo)
>                 return -ENOMEM;
> 
>         cmr_array_size = sizeof(struct cmr_info) * MAX_CMRS;
> 
>         /* CMR array has to be 512-aligned */
>         cmr_array_size = round_up(cmr_array_size, 512);

Should we define a macro for 512

	+#define CMR_INFO_ARRAY_ALIGNMENT	512

And get rid of this comment?  AFAICT Dave didn't like such comment mentioning
512-bytes aligned if we have a macro for that.

> 
>         cmr_array = kmalloc(cmr_array_size, GFP_KERNEL);
>         if (!cmr_array) {
>                 kfree(sysinfo);
>                 return -ENOMEM;
>         }
> 
> ?
> 

I confess the reason I used __get_free_page() was to avoid having to allocate
twice, and in case of failure, I need to handle additional memory free.  But I
can do if you think it's clearer?

I wouldn't worry about wasting memory.  The buffer is freed anyway for now. 
Long-termly it's just 4K.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-27 10:28     ` Huang, Kai
@ 2023-06-27 11:36       ` kirill.shutemov
  2023-06-28  0:19       ` Isaku Yamahata
  1 sibling, 0 replies; 159+ messages in thread
From: kirill.shutemov @ 2023-06-27 11:36 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Yamahata, Isaku, Chatre,
	Reinette, nik.borisov, hpa, peterz, Shahar, Sagi, imammedo, bp,
	Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
	x86, Williams, Dan J

On Tue, Jun 27, 2023 at 10:28:20AM +0000, Huang, Kai wrote:
> > Or is that wrapper only used for never-fail SEAMCALLs? If so, please
> > document it.
> > 
> 
> How about adding below?
> 
> 	Use __seamcall() directly in cases that printing error message isn't
> 	desired, e.g., when SEAMCALL can legally fail with BUSY and the caller
> 	wants to retry.
> 

Looks good to me.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2023-06-27 10:45     ` Huang, Kai
@ 2023-06-27 11:37       ` kirill.shutemov
  2023-06-27 11:46         ` Huang, Kai
  0 siblings, 1 reply; 159+ messages in thread
From: kirill.shutemov @ 2023-06-27 11:37 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Yamahata, Isaku, Chatre,
	Reinette, nik.borisov, hpa, peterz, Shahar, Sagi, imammedo, bp,
	Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
	x86, Williams, Dan J

On Tue, Jun 27, 2023 at 10:45:33AM +0000, Huang, Kai wrote:
> On Tue, 2023-06-27 at 12:51 +0300, kirill.shutemov@linux.intel.com wrote:
> > On Tue, Jun 27, 2023 at 02:12:38AM +1200, Kai Huang wrote:
> > >  static int init_tdx_module(void)
> > >  {
> > > +	struct tdsysinfo_struct *sysinfo;
> > > +	struct cmr_info *cmr_array;
> > > +	int ret;
> > > +
> > > +	/*
> > > +	 * Get the TDSYSINFO_STRUCT and CMRs from the TDX module.
> > > +	 *
> > > +	 * The buffers of the TDSYSINFO_STRUCT and the CMR array passed
> > > +	 * to the TDX module must be 1024-bytes and 512-bytes aligned
> > > +	 * respectively.  Allocate one page to accommodate them both and
> > > +	 * also meet those alignment requirements.
> > > +	 */
> > > +	sysinfo = (struct tdsysinfo_struct *)__get_free_page(GFP_KERNEL);
> > > +	if (!sysinfo)
> > > +		return -ENOMEM;
> > > +	cmr_array = (struct cmr_info *)((unsigned long)sysinfo + PAGE_SIZE / 2);
> > > +
> > > +	BUILD_BUG_ON(PAGE_SIZE / 2 < TDSYSINFO_STRUCT_SIZE);
> > > +	BUILD_BUG_ON(PAGE_SIZE / 2 < sizeof(struct cmr_info) * MAX_CMRS);
> > 
> > This works, but why not just use slab for this? kmalloc has 512 and 1024
> > pools already and you won't waste memory for rounding up.
> > 
> > Something like this:
> > 
> >         sysinfo = kmalloc(TDSYSINFO_STRUCT_SIZE, GFP_KERNEL);
> >         if (!sysinfo)
> >                 return -ENOMEM;
> > 
> >         cmr_array_size = sizeof(struct cmr_info) * MAX_CMRS;
> > 
> >         /* CMR array has to be 512-aligned */
> >         cmr_array_size = round_up(cmr_array_size, 512);
> 
> Should we define a macro for 512
> 
> 	+#define CMR_INFO_ARRAY_ALIGNMENT	512
> 
> And get rid of this comment?  AFAICT Dave didn't like such comment mentioning
> 512-bytes aligned if we have a macro for that.

Good idea.

> >         cmr_array = kmalloc(cmr_array_size, GFP_KERNEL);
> >         if (!cmr_array) {
> >                 kfree(sysinfo);
> >                 return -ENOMEM;
> >         }
> > 
> > ?
> > 
> 
> I confess the reason I used __get_free_page() was to avoid having to allocate
> twice, and in case of failure, I need to handle additional memory free.  But I
> can do if you think it's clearer?

Less trickery is always cleaner. Especially if the trick is not justified.

> I wouldn't worry about wasting memory.  The buffer is freed anyway for now. 
> Long-termly it's just 4K.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2023-06-27 11:37       ` kirill.shutemov
@ 2023-06-27 11:46         ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-27 11:46 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: kvm, Williams, Dan J, Raj, Ashok, Luck, Tony, david, bagasdotme,
	Hansen, Dave, ak, Wysocki, Rafael J, linux-kernel, Chatre,
	Reinette, Christopherson,,
	Sean, pbonzini, mingo, Yamahata, Isaku, nik.borisov, tglx,
	linux-mm, hpa, peterz, imammedo, Shahar, Sagi, bp, Brown, Len,
	Gao, Chao, sathyanarayanan.kuppuswamy, Huang, Ying, x86

On Tue, 2023-06-27 at 14:37 +0300, kirill.shutemov@linux.intel.com wrote:
> On Tue, Jun 27, 2023 at 10:45:33AM +0000, Huang, Kai wrote:
> > On Tue, 2023-06-27 at 12:51 +0300, kirill.shutemov@linux.intel.com wrote:
> > > On Tue, Jun 27, 2023 at 02:12:38AM +1200, Kai Huang wrote:
> > > >  static int init_tdx_module(void)
> > > >  {
> > > > +	struct tdsysinfo_struct *sysinfo;
> > > > +	struct cmr_info *cmr_array;
> > > > +	int ret;
> > > > +
> > > > +	/*
> > > > +	 * Get the TDSYSINFO_STRUCT and CMRs from the TDX module.
> > > > +	 *
> > > > +	 * The buffers of the TDSYSINFO_STRUCT and the CMR array passed
> > > > +	 * to the TDX module must be 1024-bytes and 512-bytes aligned
> > > > +	 * respectively.  Allocate one page to accommodate them both and
> > > > +	 * also meet those alignment requirements.
> > > > +	 */
> > > > +	sysinfo = (struct tdsysinfo_struct *)__get_free_page(GFP_KERNEL);
> > > > +	if (!sysinfo)
> > > > +		return -ENOMEM;
> > > > +	cmr_array = (struct cmr_info *)((unsigned long)sysinfo + PAGE_SIZE / 2);
> > > > +
> > > > +	BUILD_BUG_ON(PAGE_SIZE / 2 < TDSYSINFO_STRUCT_SIZE);
> > > > +	BUILD_BUG_ON(PAGE_SIZE / 2 < sizeof(struct cmr_info) * MAX_CMRS);
> > > 
> > > This works, but why not just use slab for this? kmalloc has 512 and 1024
> > > pools already and you won't waste memory for rounding up.
> > > 
> > > Something like this:
> > > 
> > >         sysinfo = kmalloc(TDSYSINFO_STRUCT_SIZE, GFP_KERNEL);
> > >         if (!sysinfo)
> > >                 return -ENOMEM;
> > > 
> > >         cmr_array_size = sizeof(struct cmr_info) * MAX_CMRS;
> > > 
> > >         /* CMR array has to be 512-aligned */
> > >         cmr_array_size = round_up(cmr_array_size, 512);
> > 
> > Should we define a macro for 512
> > 
> > 	+#define CMR_INFO_ARRAY_ALIGNMENT	512
> > 
> > And get rid of this comment?  AFAICT Dave didn't like such comment mentioning
> > 512-bytes aligned if we have a macro for that.
> 
> Good idea.
> 
> > >         cmr_array = kmalloc(cmr_array_size, GFP_KERNEL);
> > >         if (!cmr_array) {
> > >                 kfree(sysinfo);
> > >                 return -ENOMEM;
> > >         }
> > > 
> > > ?
> > > 
> > 
> > I confess the reason I used __get_free_page() was to avoid having to allocate
> > twice, and in case of failure, I need to handle additional memory free.  But I
> > can do if you think it's clearer?
> 
> Less trickery is always cleaner. Especially if the trick is not justified.
> 
> 

Alright.  I'll change to allocating them separately if no opinion from others.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-27 10:34     ` Huang, Kai
@ 2023-06-27 12:18       ` kirill.shutemov
  2023-06-27 22:37         ` Huang, Kai
  0 siblings, 1 reply; 159+ messages in thread
From: kirill.shutemov @ 2023-06-27 12:18 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Yamahata, Isaku, Chatre,
	Reinette, nik.borisov, hpa, peterz, Shahar, Sagi, imammedo, bp,
	Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
	x86, Williams, Dan J

On Tue, Jun 27, 2023 at 10:34:04AM +0000, Huang, Kai wrote:
> On Tue, 2023-06-27 at 12:50 +0300, kirill.shutemov@linux.intel.com wrote:
> > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > > +	/*
> > > +	 * The TDX module global initialization only needs to be done
> > > +	 * once on any cpu.
> > > +	 */
> > > +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> > 
> > I don't understand how the comment justifies using raw spin lock.
> > 
> 
> This comment is for using lock in general.  The reason to use raw_ version is
> because this function gets called in IRQ context, and for PREEMPT_RT kernel the
> normal spinlock is converted to sleeping lock.

Sorry, but this still doesn't explain anything.

Why converting to sleeping lock here is wrong? There are plenty
spin_lock_irqsave() users all over the kernel that are fine to be
converted to sleeping lock on RT kernel. Why this use-case is special
enough to justify raw_?

From the documentation:

	raw_spinlock_t is a strict spinning lock implementation in all
	kernels, including PREEMPT_RT kernels. Use raw_spinlock_t only in
	real critical core code, low-level interrupt handling and places
	where disabling preemption or interrupts is required, for example,
	to safely access hardware state. raw_spinlock_t can sometimes also
	be used when the critical section is tiny, thus avoiding RT-mutex
	overhead.

How does it apply here?

> Dave suggested to comment on the function rather than comment on the
> raw_spin_lock directly, e.g.,  no other kernel code does that:
> 
> https://lore.kernel.org/linux-mm/d2b3bc5e-1371-0c50-8ecb-64fc70917d42@intel.com/
> 
> So I commented the function in this version:
> 
> +/*
> + * Do the module global initialization if not done yet.
> + * It's always called with interrupts and preemption disabled.
> + */

If interrupts are always disabled why do you need _irqsave()?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-27 12:18       ` kirill.shutemov
@ 2023-06-27 22:37         ` Huang, Kai
  2023-06-28  0:28           ` Huang, Kai
  2023-06-28  0:31           ` Isaku Yamahata
  0 siblings, 2 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-27 22:37 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: kvm, Williams, Dan J, Raj, Ashok, Luck, Tony, david, bagasdotme,
	Hansen, Dave, ak, Wysocki, Rafael J, linux-kernel, Chatre,
	Reinette, Christopherson,,
	Sean, pbonzini, mingo, Yamahata, Isaku, nik.borisov, tglx,
	linux-mm, hpa, peterz, imammedo, Shahar, Sagi, bp, Brown, Len,
	Gao, Chao, sathyanarayanan.kuppuswamy, Huang, Ying, x86

> > 
> > +/*
> > + * Do the module global initialization if not done yet.
> > + * It's always called with interrupts and preemption disabled.
> > + */
> 
> If interrupts are always disabled why do you need _irqsave()?
> 

I'll remove the _irqsave().

AFAICT Isaku preferred this for additional security, but this is not necessary.



^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-27 10:28     ` Huang, Kai
  2023-06-27 11:36       ` kirill.shutemov
@ 2023-06-28  0:19       ` Isaku Yamahata
  1 sibling, 0 replies; 159+ messages in thread
From: Isaku Yamahata @ 2023-06-28  0:19 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kirill.shutemov, kvm, Raj, Ashok, Hansen, Dave, david,
	bagasdotme, Luck, Tony, ak, Wysocki, Rafael J, linux-kernel,
	Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Yamahata, Isaku, Chatre,
	Reinette, nik.borisov, hpa, peterz, Shahar, Sagi, imammedo, bp,
	Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
	x86, Williams, Dan J, isaku.yamahata

On Tue, Jun 27, 2023 at 10:28:20AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Tue, 2023-06-27 at 12:48 +0300, kirill.shutemov@linux.intel.com wrote:
> > On Tue, Jun 27, 2023 at 02:12:35AM +1200, Kai Huang wrote:
> > > +/*
> > > + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > > + * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> > > + * leaf function return code and the additional output respectively if
> > > + * not NULL.
> > > + */
> > > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> > > +				    u64 *seamcall_ret,
> > > +				    struct tdx_module_output *out)
> > > +{
> > > +	u64 sret;
> > > +	int cpu;
> > > +
> > > +	/* Need a stable CPU id for printing error message */
> > > +	cpu = get_cpu();
> > > +	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > > +	put_cpu();
> > > +
> > > +	/* Save SEAMCALL return code if the caller wants it */
> > > +	if (seamcall_ret)
> > > +		*seamcall_ret = sret;
> > > +
> > > +	switch (sret) {
> > > +	case 0:
> > > +		/* SEAMCALL was successful */
> > > +		return 0;
> > > +	case TDX_SEAMCALL_VMFAILINVALID:
> > > +		pr_err_once("module is not loaded.\n");
> > > +		return -ENODEV;
> > > +	default:
> > > +		pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
> > > +				cpu, fn, sret);
> > > +		if (out)
> > > +			pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
> > > +					out->rcx, out->rdx, out->r8,
> > > +					out->r9, out->r10, out->r11);
> > 
> > This look excessively noisy.
> > 
> > Don't we have SEAMCALL leafs that can fail in normal situation? Like
> > TDX_OPERAND_BUSY error code that indicate that operation likely will
> > succeed on retry.
> 
> For TDX module initialization TDX_OPERAND_BUSY cannot happen.  KVM may have
> legal cases that BUSY can happen, e.g., KVM's TDP MMU supports handling faults
> concurrently on different cpus, but that is still under discussion.  Also KVM
> tends to use __seamcall() directly:
> 
> https://lore.kernel.org/lkml/3c2c142e14a04a833b47f77faecaa91899b472cd.1678643052.git.isaku.yamahata@intel.com/
> 
> I guess KVM doesn't want to print message in all cases as you said, but for
> module initialization is fine.  Those error messages are useful in case
> something goes wrong, and printing them in seamcall() helps to reduce the code
> to print in all callers.

That's right.  KVM wants to do its own error handling and error messaging.  Its
requirement is different from TDX module initialization. I didn't see much
benefit to unify the function.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-27 22:37         ` Huang, Kai
@ 2023-06-28  0:28           ` Huang, Kai
  2023-06-28 11:55             ` kirill.shutemov
  2023-06-28 13:35             ` Peter Zijlstra
  2023-06-28  0:31           ` Isaku Yamahata
  1 sibling, 2 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-28  0:28 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: kvm, Raj, Ashok, Huang, Ying, Hansen, Dave, david, bagasdotme,
	ak, Wysocki, Rafael J, linux-kernel, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, Yamahata, Isaku, nik.borisov, tglx, Luck,
	Tony, linux-mm, hpa, peterz, imammedo, Shahar, Sagi, bp, Brown,
	Len, Gao, Chao, sathyanarayanan.kuppuswamy, Williams, Dan J, x86

On Tue, 2023-06-27 at 22:37 +0000, Huang, Kai wrote:
> > > 
> > > +/*
> > > + * Do the module global initialization if not done yet.
> > > + * It's always called with interrupts and preemption disabled.
> > > + */
> > 
> > If interrupts are always disabled why do you need _irqsave()?
> > 
> 
> I'll remove the _irqsave().
> 
> AFAICT Isaku preferred this for additional security, but this is not
> necessary.
> 
> 

Damn.  I think we can change the comment to say this function is called with
preemption being disabled, but _can_ be called with interrupt disabled.  And we
keep using the _irqsave() version.

	/*
	 * Do the module global initialization if not done yet.  It's always
	 * called with preemption disabled and can be called with interrupts
	 * disabled.
	 */

This allows a use case that the caller simply wants to call some SEAMCALL on
local cpu, e.g., IOMMU code may just use below to get some TDX-IO information:

	preempt_disable();
	vmxon();
	tdx_cpu_enable();
	SEAMCALL;
	vmxoff();
	preempt_enable();

Are you OK with this?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-27 22:37         ` Huang, Kai
  2023-06-28  0:28           ` Huang, Kai
@ 2023-06-28  0:31           ` Isaku Yamahata
  1 sibling, 0 replies; 159+ messages in thread
From: Isaku Yamahata @ 2023-06-28  0:31 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kirill.shutemov, kvm, Williams, Dan J, Raj, Ashok, Luck, Tony,
	david, bagasdotme, Hansen, Dave, ak, Wysocki, Rafael J,
	linux-kernel, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, mingo, Yamahata, Isaku, nik.borisov, tglx,
	linux-mm, hpa, peterz, imammedo, Shahar, Sagi, bp, Brown, Len,
	Gao, Chao, sathyanarayanan.kuppuswamy, Huang, Ying, x86,
	isaku.yamahata

On Tue, Jun 27, 2023 at 10:37:58PM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> > > 
> > > +/*
> > > + * Do the module global initialization if not done yet.
> > > + * It's always called with interrupts and preemption disabled.
> > > + */
> > 
> > If interrupts are always disabled why do you need _irqsave()?
> > 
> 
> I'll remove the _irqsave().
> 
> AFAICT Isaku preferred this for additional security, but this is not necessary.

It's because the lockdep complains.  Anyway, it's save to remove _irqsave as
discussed with you.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-26 14:12 ` [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure Kai Huang
  2023-06-27  9:48   ` kirill.shutemov
@ 2023-06-28  3:09   ` Chao Gao
  2023-06-28  3:34     ` Huang, Kai
  2023-06-28 12:58   ` Peter Zijlstra
  2 siblings, 1 reply; 159+ messages in thread
From: Chao Gao @ 2023-06-28  3:09 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, peterz, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

>+/*
>+ * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
>+ * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
>+ * leaf function return code and the additional output respectively if
>+ * not NULL.
>+ */
>+static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>+				    u64 *seamcall_ret,
>+				    struct tdx_module_output *out)
>+{
>+	u64 sret;
>+	int cpu;
>+
>+	/* Need a stable CPU id for printing error message */
>+	cpu = get_cpu();
>+	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
>+	put_cpu();
>+
>+	/* Save SEAMCALL return code if the caller wants it */
>+	if (seamcall_ret)
>+		*seamcall_ret = sret;

Hi Kai,

All callers in this series pass NULL for seamcall_ret. I am no sure if
you keep it intentionally.

>+
>+	switch (sret) {
>+	case 0:
>+		/* SEAMCALL was successful */

Nit: if you add

#define TDX_SUCCESS	0

and do

	case TDX_SUCCESS:
		return 0;

then the code becomes self-explanatory. i.e., you can drop the comment.

>+		return 0;

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-28  3:09   ` Chao Gao
@ 2023-06-28  3:34     ` Huang, Kai
  2023-06-28 11:50       ` kirill.shutemov
  2023-06-29 11:25       ` David Hildenbrand
  0 siblings, 2 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-28  3:34 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, peterz, Shahar,
	Sagi, imammedo, bp, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Wed, 2023-06-28 at 11:09 +0800, Chao Gao wrote:
> > +/*
> > + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > + * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> > + * leaf function return code and the additional output respectively if
> > + * not NULL.
> > + */
> > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64
> > r9,
> > +				    u64 *seamcall_ret,
> > +				    struct tdx_module_output *out)
> > +{
> > +	u64 sret;
> > +	int cpu;
> > +
> > +	/* Need a stable CPU id for printing error message */
> > +	cpu = get_cpu();
> > +	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > +	put_cpu();
> > +
> > +	/* Save SEAMCALL return code if the caller wants it */
> > +	if (seamcall_ret)
> > +		*seamcall_ret = sret;
> 
> Hi Kai,
> 
> All callers in this series pass NULL for seamcall_ret. I am no sure if
> you keep it intentionally.

In this series all the callers doesn't need seamcall_ret.

> 
> > +
> > +	switch (sret) {
> > +	case 0:
> > +		/* SEAMCALL was successful */
> 
> Nit: if you add
> 
> #define TDX_SUCCESS	0
> 
> and do
> 
> 	case TDX_SUCCESS:
> 		return 0;
> 
> then the code becomes self-explanatory. i.e., you can drop the comment.

If using this, I ended up with below:

--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -23,6 +23,8 @@
 #define TDX_SEAMCALL_GP                        (TDX_SW_ERROR | X86_TRAP_GP)
 #define TDX_SEAMCALL_UD                        (TDX_SW_ERROR | X86_TRAP_UD)
 
+#define TDX_SUCCESS           0
+

Hi Kirill/Dave/David,

Are you happy with this?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 00/22] TDX host kernel support
  2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
                   ` (21 preceding siblings ...)
  2023-06-26 14:12 ` [PATCH v12 22/22] Documentation/x86: Add documentation for TDX host support Kai Huang
@ 2023-06-28  7:04 ` Yuan Yao
  2023-06-28  8:12   ` Huang, Kai
  22 siblings, 1 reply; 159+ messages in thread
From: Yuan Yao @ 2023-06-28  7:04 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, peterz, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:30AM +1200, Kai Huang wrote:
> Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks.  TDX specs are available in [1].
>
> This series is the initial support to enable TDX with minimal code to
> allow KVM to create and run TDX guests.  KVM support for TDX is being
> developed separately[2].  A new "userspace inaccessible memfd" approach
> to support TDX private memory is also being developed[3].  The KVM will
> only support the new "userspace inaccessible memfd" as TDX guest memory.
>
> Also, a few first generations of TDX hardware have an erratum[4], and
> require additional handing.
>
> This series doesn't aim to support all functionalities, and doesn't aim
> to resolve all things perfectly.  All other optimizations will be posted
> as follow-up once this initial TDX support is upstreamed.
>
> (For memory hotplug, sorry for broadcasting widely but I cc'ed the
> linux-mm@kvack.org following Kirill's suggestion so MM experts can also
> help to provide comments.)

.....

>
> == Design Considerations ==
>
> 1. Initialize the TDX module at runtime
>
> There are basically two ways the TDX module could be initialized: either
> in early boot, or at runtime before the first TDX guest is run.  This
> series implements the runtime initialization.
>
> Also, TDX requires a per-cpu initialization SEAMCALL to be done before
> making any SEAMCALL on that cpu.
>
> This series adds two functions: tdx_cpu_enable() and tdx_enable() to do
> per-cpu initialization and module initialization respectively.
>
> 2. CPU hotplug
>
> DX doesn't support physical (ACPI) CPU hotplug.  A non-buggy BIOS should
  ^^

Need T here.

> never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug
> event to the kernel.  This series doesn't handle physical (ACPI) CPU
> hotplug at all but depends on the BIOS to behave correctly.
>
> Also, tdx_cpu_enable() will simply return error for any hot-added cpu if
> something insane happened.
>
> Note TDX works with CPU logical online/offline, thus this series still
> allows to do logical CPU online/offline.
>
> 3. Kernel policy on TDX memory
>
> The TDX module reports a list of "Convertible Memory Region" (CMR) to
> indicate which memory regions are TDX-capable.  The TDX architecture
> allows the VMM to designate specific convertible memory regions as usable
> for TDX private memory.
>
> The initial support of TDX guests will only allocate TDX private memory
> from the global page allocator.  This series chooses to designate _all_
> system RAM in the core-mm at the time of initializing TDX module as TDX
> memory to guarantee all pages in the page allocator are TDX pages.
>
> 4. Memory Hotplug
>
> After the kernel passes all "TDX-usable" memory regions to the TDX
> module, the set of "TDX-usable" memory regions are fixed during module's
> runtime.  No more "TDX-usable" memory can be added to the TDX module
> after that.
>
> To achieve above "to guarantee all pages in the page allocator are TDX
> pages", this series simply choose to reject any non-TDX-usable memory in
> memory hotplug.
>
> 5. Physical Memory Hotplug
>
> Note TDX assumes convertible memory is always physically present during
> machine's runtime.  A non-buggy BIOS should never support hot-removal of
> any convertible memory.  This implementation doesn't handle ACPI memory
> removal but depends on the BIOS to behave correctly.
>
> Also, if something insane really happened, 4) makes sure either TDX

Please remove "4)" if have no specific meaning here.

> cannot be enabled or hot-added memory will be rejected after TDX gets
> enabled.
>
> 6. Kexec()
>
> Similar to AMD's SME, in kexec() kernel needs to flush dirty cachelines
> of TDX private memory otherwise they may silently corrupt the new kernel.
>
> 7. TDX erratum
>
> The first few generations of TDX hardware have an erratum.  A partial
> write to a TDX private memory cacheline will silently "poison" the
> line.  Subsequent reads will consume the poison and generate a machine
> check.
>
> The fast warm reset reboot doesn't reset TDX private memory.  With this
> erratum, all TDX private pages needs to be converted back to normal
> before a fast warm reset reboot or booting to the new kernel in kexec().
> Otherwise, the new kernel may get unexpected machine check.
>
> In normal condition, triggering the erratum in Linux requires some kind
> of kernel bug involving relatively exotic memory writes to TDX private
> memory and will manifest via spurious-looking machine checks when
> reading the affected memory.  Machine check handler is improved to deal
> with such machine check.
>
>
> [1]: TDX specs
> https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
>
> [2]: KVM TDX basic feature support
> https://lore.kernel.org/kvm/cover.1685333727.git.isaku.yamahata@intel.com/T/#t
>
> [3]: KVM: mm: fd-based approach for supporting KVM
> https://lore.kernel.org/kvm/20221202061347.1070246-1-chao.p.peng@linux.intel.com/
>
> [4]: TDX erratum
> https://cdrdv2.intel.com/v1/dl/getContent/772415?explicitVersion=true
>
>
>
>
> Kai Huang (22):
>   x86/tdx: Define TDX supported page sizes as macros
>   x86/virt/tdx: Detect TDX during kernel boot
>   x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
>   x86/cpu: Detect TDX partial write machine check erratum
>   x86/virt/tdx: Add SEAMCALL infrastructure
>   x86/virt/tdx: Handle SEAMCALL running out of entropy error
>   x86/virt/tdx: Add skeleton to enable TDX on demand
>   x86/virt/tdx: Get information about TDX module and TDX-capable memory
>   x86/virt/tdx: Use all system memory when initializing TDX module as
>     TDX memory
>   x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX
>     memory regions
>   x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
>   x86/virt/tdx: Allocate and set up PAMTs for TDMRs
>   x86/virt/tdx: Designate reserved areas for all TDMRs
>   x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
>   x86/virt/tdx: Configure global KeyID on all packages
>   x86/virt/tdx: Initialize all TDMRs
>   x86/kexec: Flush cache of TDX private memory
>   x86/virt/tdx: Keep TDMRs when module initialization is successful
>   x86/kexec(): Reset TDX private memory on platforms with TDX erratum
>   x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
>   x86/mce: Improve error log of kernel space TDX #MC due to erratum
>   Documentation/x86: Add documentation for TDX host support
>
>  Documentation/arch/x86/tdx.rst     |  189 +++-
>  arch/x86/Kconfig                   |   15 +
>  arch/x86/Makefile                  |    2 +
>  arch/x86/coco/tdx/tdx.c            |    6 +-
>  arch/x86/include/asm/cpufeatures.h |    1 +
>  arch/x86/include/asm/msr-index.h   |    3 +
>  arch/x86/include/asm/tdx.h         |   26 +
>  arch/x86/kernel/cpu/intel.c        |   17 +
>  arch/x86/kernel/cpu/mce/core.c     |   33 +
>  arch/x86/kernel/machine_kexec_64.c |    9 +
>  arch/x86/kernel/process.c          |    7 +-
>  arch/x86/kernel/reboot.c           |   15 +
>  arch/x86/kernel/setup.c            |    2 +
>  arch/x86/virt/Makefile             |    2 +
>  arch/x86/virt/vmx/Makefile         |    2 +
>  arch/x86/virt/vmx/tdx/Makefile     |    2 +
>  arch/x86/virt/vmx/tdx/seamcall.S   |   52 +
>  arch/x86/virt/vmx/tdx/tdx.c        | 1542 ++++++++++++++++++++++++++++
>  arch/x86/virt/vmx/tdx/tdx.h        |  151 +++
>  arch/x86/virt/vmx/tdx/tdxcall.S    |   19 +-
>  20 files changed, 2078 insertions(+), 17 deletions(-)
>  create mode 100644 arch/x86/virt/Makefile
>  create mode 100644 arch/x86/virt/vmx/Makefile
>  create mode 100644 arch/x86/virt/vmx/tdx/Makefile
>  create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
>  create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
>  create mode 100644 arch/x86/virt/vmx/tdx/tdx.h
>
>
> base-commit: 94142c9d1bdf1c18027a42758ceb6bdd59a92012
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 00/22] TDX host kernel support
  2023-06-28  7:04 ` [PATCH v12 00/22] TDX host kernel support Yuan Yao
@ 2023-06-28  8:12   ` Huang, Kai
  2023-06-29  1:01     ` Yuan Yao
  0 siblings, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-28  8:12 UTC (permalink / raw)
  To: yuan.yao
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, peterz, Shahar,
	Sagi, imammedo, bp, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J, x86

> > 
> > 2. CPU hotplug
> > 
> > DX doesn't support physical (ACPI) CPU hotplug.  A non-buggy BIOS should
>   ^^
> 
> Need T here.

Thanks!

> 
[...]

> > 4. Memory Hotplug
> > 
> > After the kernel passes all "TDX-usable" memory regions to the TDX
> > module, the set of "TDX-usable" memory regions are fixed during module's
> > runtime.  No more "TDX-usable" memory can be added to the TDX module
> > after that.
> > 
> > To achieve above "to guarantee all pages in the page allocator are TDX
> > pages", this series simply choose to reject any non-TDX-usable memory in
> > memory hotplug.
> > 
> > 5. Physical Memory Hotplug
> > 
> > Note TDX assumes convertible memory is always physically present during
> > machine's runtime.  A non-buggy BIOS should never support hot-removal of
> > any convertible memory.  This implementation doesn't handle ACPI memory
> > removal but depends on the BIOS to behave correctly.
> > 
> > Also, if something insane really happened, 4) makes sure either TDX
> 
> Please remove "4)" if have no specific meaning here.
> 

It means the mechanism mentioned in "4. Memory hotplug".
> 

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful
  2023-06-26 14:12 ` [PATCH v12 18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful Kai Huang
@ 2023-06-28  9:04   ` Nikolay Borisov
  2023-06-29  1:03     ` Huang, Kai
  2023-06-28 12:23   ` kirill.shutemov
  1 sibling, 1 reply; 159+ messages in thread
From: Nikolay Borisov @ 2023-06-28  9:04 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	bagasdotme, sagis, imammedo



On 26.06.23 г. 17:12 ч., Kai Huang wrote:
> On the platforms with the "partial write machine check" erratum, the
> kexec() needs to convert all TDX private pages back to normal before
> booting to the new kernel.  Otherwise, the new kernel may get unexpected
> machine check.
> 
> There's no existing infrastructure to track TDX private pages.  Change
> to keep TDMRs when module initialization is successful so that they can
> be used to find PAMTs.
> 
> With this change, only put_online_mems() and freeing the buffer of the
> TDSYSINFO_STRUCT and CMR array still need to be done even when module
> initialization is successful.  Adjust the error handling to explicitly
> do them when module initialization is successful and unconditionally
> clean up the rest when initialization fails.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
> 
> v11 -> v12 (new patch):
>    - Defer keeping TDMRs logic to this patch for better review
>    - Improved error handling logic (Nikolay/Kirill in patch 15)
> 
> ---
>   arch/x86/virt/vmx/tdx/tdx.c | 84 ++++++++++++++++++-------------------
>   1 file changed, 42 insertions(+), 42 deletions(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 52b7267ea226..85b24b2e9417 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -49,6 +49,8 @@ static DEFINE_MUTEX(tdx_module_lock);
>   /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
>   static LIST_HEAD(tdx_memlist);
>   
> +static struct tdmr_info_list tdx_tdmr_list;
> +
>   /*
>    * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
>    * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> @@ -1047,7 +1049,6 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
>   static int init_tdx_module(void)
>   {
>   	struct tdsysinfo_struct *sysinfo;
> -	struct tdmr_info_list tdmr_list;
>   	struct cmr_info *cmr_array;
>   	int ret;
>   
> @@ -1088,17 +1089,17 @@ static int init_tdx_module(void)
>   		goto out_put_tdxmem;
>   
>   	/* Allocate enough space for constructing TDMRs */
> -	ret = alloc_tdmr_list(&tdmr_list, sysinfo);
> +	ret = alloc_tdmr_list(&tdx_tdmr_list, sysinfo);
>   	if (ret)
>   		goto out_free_tdxmem;
>   
>   	/* Cover all TDX-usable memory regions in TDMRs */
> -	ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo);
> +	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, sysinfo);

nit: Does it make sense to keep passing those global variables are 
function parameters? Since those functions are static it's unlikely that 
they are going to be used with any other parameter so might as well use 
the parameter directly. It makes the code somewhat easier to follow.

<snip>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum
  2023-06-26 14:12 ` [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum Kai Huang
@ 2023-06-28  9:20   ` Nikolay Borisov
  2023-06-29  0:32     ` Dave Hansen
  2023-06-29  3:19     ` Huang, Kai
  2023-06-28 12:29   ` kirill.shutemov
  2023-07-07  4:01   ` Yuan Yao
  2 siblings, 2 replies; 159+ messages in thread
From: Nikolay Borisov @ 2023-06-28  9:20 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	bagasdotme, sagis, imammedo



On 26.06.23 г. 17:12 ч., Kai Huang wrote:

<snip>



> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 85b24b2e9417..1107f4227568 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -51,6 +51,8 @@ static LIST_HEAD(tdx_memlist);
>   
>   static struct tdmr_info_list tdx_tdmr_list;
>   
> +static atomic_t tdx_may_has_private_mem;
> +
>   /*
>    * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
>    * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> @@ -1113,6 +1115,17 @@ static int init_tdx_module(void)
>   	 */
>   	wbinvd_on_all_cpus();
>   
> +	/*
> +	 * Starting from this point the system may have TDX private
> +	 * memory.  Make it globally visible so tdx_reset_memory() only
> +	 * reads TDMRs/PAMTs when they are stable.
> +	 *
> +	 * Note using atomic_inc_return() to provide the explicit memory
> +	 * ordering isn't mandatory here as the WBINVD above already
> +	 * does that.  Compiler barrier isn't needed here either.
> +	 */

If it's not needed, then why use it? Simply do atomic_inc() and instead 
rephrase the comment to state what are the ordering guarantees and how 
they are achieved (i.e by using wbinvd above).

> +	atomic_inc_return(&tdx_may_has_private_mem);
> +
>   	/* Config the key of global KeyID on all packages */
>   	ret = config_global_keyid();
>   	if (ret)
> @@ -1154,6 +1167,15 @@ static int init_tdx_module(void)
>   	 * as suggested by the TDX spec.
>   	 */
>   	tdmrs_reset_pamt_all(&tdx_tdmr_list);
> +	/*
> +	 * No more TDX private pages now, and PAMTs/TDMRs are
> +	 * going to be freed.  Make this globally visible so
> +	 * tdx_reset_memory() can read stable TDMRs/PAMTs.
> +	 *
> +	 * Note atomic_dec_return(), which is an atomic RMW with
> +	 * return value, always enforces the memory barrier.
> +	 */
> +	atomic_dec_return(&tdx_may_has_private_mem);

Make a comment here which either refers to the comment at the increment 
site.

>   out_free_pamts:
>   	tdmrs_free_pamt_all(&tdx_tdmr_list);
>   out_free_tdmrs:
> @@ -1229,6 +1251,63 @@ int tdx_enable(void)
>   }
>   EXPORT_SYMBOL_GPL(tdx_enable);
>   
> +/*
> + * Convert TDX private pages back to normal on platforms with
> + * "partial write machine check" erratum.
> + *
> + * Called from machine_kexec() before booting to the new kernel.
> + */
> +void tdx_reset_memory(void)
> +{
> +	if (!platform_tdx_enabled())
> +		return;
> +
> +	/*
> +	 * Kernel read/write to TDX private memory doesn't
> +	 * cause machine check on hardware w/o this erratum.
> +	 */
> +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> +		return;
> +
> +	/* Called from kexec() when only rebooting cpu is alive */
> +	WARN_ON_ONCE(num_online_cpus() != 1);
> +
> +	if (!atomic_read(&tdx_may_has_private_mem))
> +		return;

I think a comment is warranted here explicitly calling our the ordering 
requirement/guarantees. Actually this is a non-rmw operation so it 
doesn't have any bearing on the ordering/implicit mb's achieved at the 
"increment" site.

<snip>


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-28  3:34     ` Huang, Kai
@ 2023-06-28 11:50       ` kirill.shutemov
  2023-06-28 23:31         ` Huang, Kai
  2023-06-29 11:25       ` David Hildenbrand
  1 sibling, 1 reply; 159+ messages in thread
From: kirill.shutemov @ 2023-06-28 11:50 UTC (permalink / raw)
  To: Huang, Kai
  Cc: Gao, Chao, kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme,
	Luck, Tony, ak, Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Chatre, Reinette,
	Yamahata, Isaku, nik.borisov, hpa, peterz, Shahar, Sagi,
	imammedo, bp, Brown, Len, sathyanarayanan.kuppuswamy, Huang,
	Ying, Williams, Dan J, x86

On Wed, Jun 28, 2023 at 03:34:05AM +0000, Huang, Kai wrote:
> On Wed, 2023-06-28 at 11:09 +0800, Chao Gao wrote:
> > > +/*
> > > + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > > + * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> > > + * leaf function return code and the additional output respectively if
> > > + * not NULL.
> > > + */
> > > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64
> > > r9,
> > > +				    u64 *seamcall_ret,
> > > +				    struct tdx_module_output *out)
> > > +{
> > > +	u64 sret;
> > > +	int cpu;
> > > +
> > > +	/* Need a stable CPU id for printing error message */
> > > +	cpu = get_cpu();
> > > +	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > > +	put_cpu();
> > > +
> > > +	/* Save SEAMCALL return code if the caller wants it */
> > > +	if (seamcall_ret)
> > > +		*seamcall_ret = sret;
> > 
> > Hi Kai,
> > 
> > All callers in this series pass NULL for seamcall_ret. I am no sure if
> > you keep it intentionally.
> 
> In this series all the callers doesn't need seamcall_ret.

I'm fine keeping it if it is needed by KVM TDX enabling. Otherwise, just
drop it.

> > > +
> > > +	switch (sret) {
> > > +	case 0:
> > > +		/* SEAMCALL was successful */
> > 
> > Nit: if you add
> > 
> > #define TDX_SUCCESS	0
> > 
> > and do
> > 
> > 	case TDX_SUCCESS:
> > 		return 0;
> > 
> > then the code becomes self-explanatory. i.e., you can drop the comment.
> 
> If using this, I ended up with below:
> 
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -23,6 +23,8 @@
>  #define TDX_SEAMCALL_GP                        (TDX_SW_ERROR | X86_TRAP_GP)
>  #define TDX_SEAMCALL_UD                        (TDX_SW_ERROR | X86_TRAP_UD)
>  
> +#define TDX_SUCCESS           0
> +
> 
> Hi Kirill/Dave/David,
> 
> Are you happy with this?

Sure, looks good.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-28  0:28           ` Huang, Kai
@ 2023-06-28 11:55             ` kirill.shutemov
  2023-06-28 13:35             ` Peter Zijlstra
  1 sibling, 0 replies; 159+ messages in thread
From: kirill.shutemov @ 2023-06-28 11:55 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Huang, Ying, Hansen, Dave, david, bagasdotme,
	ak, Wysocki, Rafael J, linux-kernel, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, Yamahata, Isaku, nik.borisov, tglx, Luck,
	Tony, linux-mm, hpa, peterz, imammedo, Shahar, Sagi, bp, Brown,
	Len, Gao, Chao, sathyanarayanan.kuppuswamy, Williams, Dan J, x86

On Wed, Jun 28, 2023 at 12:28:12AM +0000, Huang, Kai wrote:
> On Tue, 2023-06-27 at 22:37 +0000, Huang, Kai wrote:
> > > > 
> > > > +/*
> > > > + * Do the module global initialization if not done yet.
> > > > + * It's always called with interrupts and preemption disabled.
> > > > + */
> > > 
> > > If interrupts are always disabled why do you need _irqsave()?
> > > 
> > 
> > I'll remove the _irqsave().
> > 
> > AFAICT Isaku preferred this for additional security, but this is not
> > necessary.
> > 
> > 
> 
> Damn.  I think we can change the comment to say this function is called with
> preemption being disabled, but _can_ be called with interrupt disabled.  And we
> keep using the _irqsave() version.
> 
> 	/*
> 	 * Do the module global initialization if not done yet.  It's always
> 	 * called with preemption disabled and can be called with interrupts
> 	 * disabled.
> 	 */
> 
> This allows a use case that the caller simply wants to call some SEAMCALL on
> local cpu, e.g., IOMMU code may just use below to get some TDX-IO information:
> 
> 	preempt_disable();
> 	vmxon();
> 	tdx_cpu_enable();
> 	SEAMCALL;
> 	vmxoff();
> 	preempt_enable();
> 
> Are you OK with this?

Is it hypothetical use-case? If so, I would rather keep it simple for now
and adjust in the future if needed.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful
  2023-06-26 14:12 ` [PATCH v12 18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful Kai Huang
  2023-06-28  9:04   ` Nikolay Borisov
@ 2023-06-28 12:23   ` kirill.shutemov
  2023-06-28 12:48     ` Nikolay Borisov
  1 sibling, 1 reply; 159+ messages in thread
From: kirill.shutemov @ 2023-06-28 12:23 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo

On Tue, Jun 27, 2023 at 02:12:48AM +1200, Kai Huang wrote:
> On the platforms with the "partial write machine check" erratum, the
> kexec() needs to convert all TDX private pages back to normal before
> booting to the new kernel.  Otherwise, the new kernel may get unexpected
> machine check.
> 
> There's no existing infrastructure to track TDX private pages.  Change
> to keep TDMRs when module initialization is successful so that they can
> be used to find PAMTs.
> 
> With this change, only put_online_mems() and freeing the buffer of the
> TDSYSINFO_STRUCT and CMR array still need to be done even when module
> initialization is successful.  Adjust the error handling to explicitly
> do them when module initialization is successful and unconditionally
> clean up the rest when initialization fails.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
> 
> v11 -> v12 (new patch):
>   - Defer keeping TDMRs logic to this patch for better review
>   - Improved error handling logic (Nikolay/Kirill in patch 15)
> 
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 84 ++++++++++++++++++-------------------
>  1 file changed, 42 insertions(+), 42 deletions(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 52b7267ea226..85b24b2e9417 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -49,6 +49,8 @@ static DEFINE_MUTEX(tdx_module_lock);
>  /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
>  static LIST_HEAD(tdx_memlist);
>  
> +static struct tdmr_info_list tdx_tdmr_list;
> +
>  /*
>   * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
>   * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> @@ -1047,7 +1049,6 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
>  static int init_tdx_module(void)
>  {
>  	struct tdsysinfo_struct *sysinfo;
> -	struct tdmr_info_list tdmr_list;
>  	struct cmr_info *cmr_array;
>  	int ret;
>  
> @@ -1088,17 +1089,17 @@ static int init_tdx_module(void)
>  		goto out_put_tdxmem;
>  
>  	/* Allocate enough space for constructing TDMRs */
> -	ret = alloc_tdmr_list(&tdmr_list, sysinfo);
> +	ret = alloc_tdmr_list(&tdx_tdmr_list, sysinfo);
>  	if (ret)
>  		goto out_free_tdxmem;
>  
>  	/* Cover all TDX-usable memory regions in TDMRs */
> -	ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo);
> +	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, sysinfo);
>  	if (ret)
>  		goto out_free_tdmrs;
>  
>  	/* Pass the TDMRs and the global KeyID to the TDX module */
> -	ret = config_tdx_module(&tdmr_list, tdx_global_keyid);
> +	ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid);
>  	if (ret)
>  		goto out_free_pamts;
>  
> @@ -1118,51 +1119,50 @@ static int init_tdx_module(void)
>  		goto out_reset_pamts;
>  
>  	/* Initialize TDMRs to complete the TDX module initialization */
> -	ret = init_tdmrs(&tdmr_list);
> +	ret = init_tdmrs(&tdx_tdmr_list);
> +	if (ret)
> +		goto out_reset_pamts;
> +
> +	pr_info("%lu KBs allocated for PAMT.\n",
> +			tdmrs_count_pamt_kb(&tdx_tdmr_list));
> +
> +	/*
> +	 * @tdx_memlist is written here and read at memory hotplug time.
> +	 * Lock out memory hotplug code while building it.
> +	 */
> +	put_online_mems();
> +	/*
> +	 * For now both @sysinfo and @cmr_array are only used during
> +	 * module initialization, so always free them.
> +	 */
> +	free_page((unsigned long)sysinfo);
> +
> +	return 0;
>  out_reset_pamts:
> -	if (ret) {
> -		/*
> -		 * Part of PAMTs may already have been initialized by the
> -		 * TDX module.  Flush cache before returning PAMTs back
> -		 * to the kernel.
> -		 */
> -		wbinvd_on_all_cpus();
> -		/*
> -		 * According to the TDX hardware spec, if the platform
> -		 * doesn't have the "partial write machine check"
> -		 * erratum, any kernel read/write will never cause #MC
> -		 * in kernel space, thus it's OK to not convert PAMTs
> -		 * back to normal.  But do the conversion anyway here
> -		 * as suggested by the TDX spec.
> -		 */
> -		tdmrs_reset_pamt_all(&tdmr_list);
> -	}
> +	/*
> +	 * Part of PAMTs may already have been initialized by the
> +	 * TDX module.  Flush cache before returning PAMTs back
> +	 * to the kernel.
> +	 */
> +	wbinvd_on_all_cpus();
> +	/*
> +	 * According to the TDX hardware spec, if the platform
> +	 * doesn't have the "partial write machine check"
> +	 * erratum, any kernel read/write will never cause #MC
> +	 * in kernel space, thus it's OK to not convert PAMTs
> +	 * back to normal.  But do the conversion anyway here
> +	 * as suggested by the TDX spec.
> +	 */
> +	tdmrs_reset_pamt_all(&tdx_tdmr_list);
>  out_free_pamts:
> -	if (ret)
> -		tdmrs_free_pamt_all(&tdmr_list);
> -	else
> -		pr_info("%lu KBs allocated for PAMT.\n",
> -				tdmrs_count_pamt_kb(&tdmr_list));
> +	tdmrs_free_pamt_all(&tdx_tdmr_list);
>  out_free_tdmrs:
> -	/*
> -	 * Always free the buffer of TDMRs as they are only used during
> -	 * module initialization.
> -	 */
> -	free_tdmr_list(&tdmr_list);
> +	free_tdmr_list(&tdx_tdmr_list);
>  out_free_tdxmem:
> -	if (ret)
> -		free_tdx_memlist(&tdx_memlist);
> +	free_tdx_memlist(&tdx_memlist);
>  out_put_tdxmem:
> -	/*
> -	 * @tdx_memlist is written here and read at memory hotplug time.
> -	 * Lock out memory hotplug code while building it.
> -	 */
>  	put_online_mems();
>  out:
> -	/*
> -	 * For now both @sysinfo and @cmr_array are only used during
> -	 * module initialization, so always free them.
> -	 */
>  	free_page((unsigned long)sysinfo);
>  	return ret;
>  }

This diff is extremely hard to follow, but I think the change to error
handling Nikolay proposed has to be applied to the function from the
beginning, not changed drastically in this patch.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum
  2023-06-26 14:12 ` [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum Kai Huang
  2023-06-28  9:20   ` Nikolay Borisov
@ 2023-06-28 12:29   ` kirill.shutemov
  2023-06-29  0:27     ` Huang, Kai
  2023-07-07  4:01   ` Yuan Yao
  2 siblings, 1 reply; 159+ messages in thread
From: kirill.shutemov @ 2023-06-28 12:29 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo

On Tue, Jun 27, 2023 at 02:12:49AM +1200, Kai Huang wrote:
> @@ -1113,6 +1115,17 @@ static int init_tdx_module(void)
>  	 */
>  	wbinvd_on_all_cpus();
>  
> +	/*
> +	 * Starting from this point the system may have TDX private
> +	 * memory.  Make it globally visible so tdx_reset_memory() only
> +	 * reads TDMRs/PAMTs when they are stable.
> +	 *
> +	 * Note using atomic_inc_return() to provide the explicit memory
> +	 * ordering isn't mandatory here as the WBINVD above already
> +	 * does that.  Compiler barrier isn't needed here either.
> +	 */
> +	atomic_inc_return(&tdx_may_has_private_mem);

Why do we need atomics at all here? Writers seems serialized with
tdx_module_lock and reader accesses the variable when all CPUs, but one is
down and cannot race.

Hm?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-26 14:12 ` [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP Kai Huang
@ 2023-06-28 12:32   ` kirill.shutemov
  2023-06-28 15:29   ` Peter Zijlstra
  1 sibling, 0 replies; 159+ messages in thread
From: kirill.shutemov @ 2023-06-28 12:32 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo

On Tue, Jun 27, 2023 at 02:12:50AM +1200, Kai Huang wrote:
> On the platform with the "partial write machine check" erratum, a kernel
> partial write to TDX private memory may cause unexpected machine check.
> It would be nice if the #MC handler could print additional information
> to show the #MC was TDX private memory error due to possible kernel bug.
> 
> To do that, the machine check handler needs to use SEAMCALL to query
> page type of the error memory from the TDX module, because there's no
> existing infrastructure to track TDX private pages.
> 
> SEAMCALL instruction causes #UD if CPU isn't in VMX operation.  In #MC
> handler, it is legal that CPU isn't in VMX operation when making this
> SEAMCALL.  Extend the TDX_MODULE_CALL macro to handle #UD so the
> SEAMCALL can return error code instead of Oops in the #MC handler.
> Opportunistically handles #GP too since they share the same code.
> 
> A bonus is when kernel mistakenly calls SEAMCALL when CPU isn't in VMX
> operation, or when TDX isn't enabled by the BIOS, or when the BIOS is
> buggy, the kernel can get a nicer error message rather than a less
> understandable Oops.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum
  2023-06-26 14:12 ` [PATCH v12 21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum Kai Huang
@ 2023-06-28 12:38   ` kirill.shutemov
  2023-07-07  7:26   ` Yuan Yao
  1 sibling, 0 replies; 159+ messages in thread
From: kirill.shutemov @ 2023-06-28 12:38 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo

On Tue, Jun 27, 2023 at 02:12:51AM +1200, Kai Huang wrote:
> The first few generations of TDX hardware have an erratum.  Triggering
> it in Linux requires some kind of kernel bug involving relatively exotic
> memory writes to TDX private memory and will manifest via
> spurious-looking machine checks when reading the affected memory.
> 
> == Background ==
> 
> Virtually all kernel memory accesses operations happen in full
> cachelines.  In practice, writing a "byte" of memory usually reads a 64
> byte cacheline of memory, modifies it, then writes the whole line back.
> Those operations do not trigger this problem.
> 
> This problem is triggered by "partial" writes where a write transaction
> of less than cacheline lands at the memory controller.  The CPU does
> these via non-temporal write instructions (like MOVNTI), or through
> UC/WC memory mappings.  The issue can also be triggered away from the
> CPU by devices doing partial writes via DMA.
> 
> == Problem ==
> 
> A partial write to a TDX private memory cacheline will silently "poison"
> the line.  Subsequent reads will consume the poison and generate a
> machine check.  According to the TDX hardware spec, neither of these
> things should have happened.
> 
> To add insult to injury, the Linux machine code will present these as a
> literal "Hardware error" when they were, in fact, a software-triggered
> issue.
> 
> == Solution ==
> 
> In the end, this issue is hard to trigger.  Rather than do something
> rash (and incomplete) like unmap TDX private memory from the direct map,
> improve the machine check handler.
> 
> Currently, the #MC handler doesn't distinguish whether the memory is
> TDX private memory or not but just dump, for instance, below message:
> 
>  [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
>  [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
>  	...
>  [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
>  [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
>  [...] Kernel panic - not syncing: Fatal local machine check
> 
> Which says "Hardware Error" and "Data load in unrecoverable area of
> kernel".
> 
> Ideally, it's better for the log to say "software bug around TDX private
> memory" instead of "Hardware Error".  But in reality the real hardware
> memory error can happen, and sadly such software-triggered #MC cannot be
> distinguished from the real hardware error.  Also, the error message is
> used by userspace tool 'mcelog' to parse, so changing the output may
> break userspace.
> 
> So keep the "Hardware Error".  The "Data load in unrecoverable area of
> kernel" is also helpful, so keep it too.
> 
> Instead of modifying above error log, improve the error log by printing
> additional TDX related message to make the log like:
> 
>   ...
>  [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
>  [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.
> 
> Adding this additional message requires determination of whether the
> memory page is TDX private memory.  There is no existing infrastructure
> to do that.  Add an interface to query the TDX module to fill this gap.
> 
> == Impact ==
> 
> This issue requires some kind of kernel bug to trigger.
> 
> TDX private memory should never be mapped UC/WC.  A partial write
> originating from these mappings would require *two* bugs, first mapping
> the wrong page, then writing the wrong memory.  It would also be
> detectable using traditional memory corruption techniques like
> DEBUG_PAGEALLOC.
> 
> MOVNTI (and friends) could cause this issue with something like a simple
> buffer overrun or use-after-free on the direct map.  It should also be
> detectable with normal debug techniques.
> 
> The one place where this might get nasty would be if the CPU read data
> then wrote back the same data.  That would trigger this problem but
> would not, for instance, set off mechanisms like slab redzoning because
> it doesn't actually corrupt data.
> 
> With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
> TDX private memory would first need to be incorrectly mapped into the
> I/O space and then a later DMA to that mapping would actually cause the
> poisoning event.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful
  2023-06-28 12:23   ` kirill.shutemov
@ 2023-06-28 12:48     ` Nikolay Borisov
  2023-06-29  0:24       ` Huang, Kai
  0 siblings, 1 reply; 159+ messages in thread
From: Nikolay Borisov @ 2023-06-28 12:48 UTC (permalink / raw)
  To: kirill.shutemov, Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	bagasdotme, sagis, imammedo



On 28.06.23 г. 15:23 ч., kirill.shutemov@linux.intel.com wrote:
> On Tue, Jun 27, 2023 at 02:12:48AM +1200, Kai Huang wrote:
>> On the platforms with the "partial write machine check" erratum, the
>> kexec() needs to convert all TDX private pages back to normal before
>> booting to the new kernel.  Otherwise, the new kernel may get unexpected
>> machine check.
>>
>> There's no existing infrastructure to track TDX private pages.  Change
>> to keep TDMRs when module initialization is successful so that they can
>> be used to find PAMTs.
>>
>> With this change, only put_online_mems() and freeing the buffer of the
>> TDSYSINFO_STRUCT and CMR array still need to be done even when module
>> initialization is successful.  Adjust the error handling to explicitly
>> do them when module initialization is successful and unconditionally
>> clean up the rest when initialization fails.
>>
>> Signed-off-by: Kai Huang <kai.huang@intel.com>
>> ---
>>
>> v11 -> v12 (new patch):
>>    - Defer keeping TDMRs logic to this patch for better review
>>    - Improved error handling logic (Nikolay/Kirill in patch 15)
>>
>> ---
>>   arch/x86/virt/vmx/tdx/tdx.c | 84 ++++++++++++++++++-------------------
>>   1 file changed, 42 insertions(+), 42 deletions(-)
>>
>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>> index 52b7267ea226..85b24b2e9417 100644
>> --- a/arch/x86/virt/vmx/tdx/tdx.c
>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>> @@ -49,6 +49,8 @@ static DEFINE_MUTEX(tdx_module_lock);
>>   /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
>>   static LIST_HEAD(tdx_memlist);
>>   
>> +static struct tdmr_info_list tdx_tdmr_list;
>> +
>>   /*
>>    * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
>>    * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
>> @@ -1047,7 +1049,6 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
>>   static int init_tdx_module(void)
>>   {
>>   	struct tdsysinfo_struct *sysinfo;
>> -	struct tdmr_info_list tdmr_list;
>>   	struct cmr_info *cmr_array;
>>   	int ret;
>>   
>> @@ -1088,17 +1089,17 @@ static int init_tdx_module(void)
>>   		goto out_put_tdxmem;
>>   
>>   	/* Allocate enough space for constructing TDMRs */
>> -	ret = alloc_tdmr_list(&tdmr_list, sysinfo);
>> +	ret = alloc_tdmr_list(&tdx_tdmr_list, sysinfo);
>>   	if (ret)
>>   		goto out_free_tdxmem;
>>   
>>   	/* Cover all TDX-usable memory regions in TDMRs */
>> -	ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo);
>> +	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, sysinfo);
>>   	if (ret)
>>   		goto out_free_tdmrs;
>>   
>>   	/* Pass the TDMRs and the global KeyID to the TDX module */
>> -	ret = config_tdx_module(&tdmr_list, tdx_global_keyid);
>> +	ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid);
>>   	if (ret)
>>   		goto out_free_pamts;
>>   
>> @@ -1118,51 +1119,50 @@ static int init_tdx_module(void)
>>   		goto out_reset_pamts;
>>   
>>   	/* Initialize TDMRs to complete the TDX module initialization */
>> -	ret = init_tdmrs(&tdmr_list);
>> +	ret = init_tdmrs(&tdx_tdmr_list);
>> +	if (ret)
>> +		goto out_reset_pamts;
>> +
>> +	pr_info("%lu KBs allocated for PAMT.\n",
>> +			tdmrs_count_pamt_kb(&tdx_tdmr_list));
>> +
>> +	/*
>> +	 * @tdx_memlist is written here and read at memory hotplug time.
>> +	 * Lock out memory hotplug code while building it.
>> +	 */
>> +	put_online_mems();
>> +	/*
>> +	 * For now both @sysinfo and @cmr_array are only used during
>> +	 * module initialization, so always free them.
>> +	 */
>> +	free_page((unsigned long)sysinfo);
>> +
>> +	return 0;
>>   out_reset_pamts:
>> -	if (ret) {
>> -		/*
>> -		 * Part of PAMTs may already have been initialized by the
>> -		 * TDX module.  Flush cache before returning PAMTs back
>> -		 * to the kernel.
>> -		 */
>> -		wbinvd_on_all_cpus();
>> -		/*
>> -		 * According to the TDX hardware spec, if the platform
>> -		 * doesn't have the "partial write machine check"
>> -		 * erratum, any kernel read/write will never cause #MC
>> -		 * in kernel space, thus it's OK to not convert PAMTs
>> -		 * back to normal.  But do the conversion anyway here
>> -		 * as suggested by the TDX spec.
>> -		 */
>> -		tdmrs_reset_pamt_all(&tdmr_list);
>> -	}
>> +	/*
>> +	 * Part of PAMTs may already have been initialized by the
>> +	 * TDX module.  Flush cache before returning PAMTs back
>> +	 * to the kernel.
>> +	 */
>> +	wbinvd_on_all_cpus();
>> +	/*
>> +	 * According to the TDX hardware spec, if the platform
>> +	 * doesn't have the "partial write machine check"
>> +	 * erratum, any kernel read/write will never cause #MC
>> +	 * in kernel space, thus it's OK to not convert PAMTs
>> +	 * back to normal.  But do the conversion anyway here
>> +	 * as suggested by the TDX spec.
>> +	 */
>> +	tdmrs_reset_pamt_all(&tdx_tdmr_list);
>>   out_free_pamts:
>> -	if (ret)
>> -		tdmrs_free_pamt_all(&tdmr_list);
>> -	else
>> -		pr_info("%lu KBs allocated for PAMT.\n",
>> -				tdmrs_count_pamt_kb(&tdmr_list));
>> +	tdmrs_free_pamt_all(&tdx_tdmr_list);
>>   out_free_tdmrs:
>> -	/*
>> -	 * Always free the buffer of TDMRs as they are only used during
>> -	 * module initialization.
>> -	 */
>> -	free_tdmr_list(&tdmr_list);
>> +	free_tdmr_list(&tdx_tdmr_list);
>>   out_free_tdxmem:
>> -	if (ret)
>> -		free_tdx_memlist(&tdx_memlist);
>> +	free_tdx_memlist(&tdx_memlist);
>>   out_put_tdxmem:
>> -	/*
>> -	 * @tdx_memlist is written here and read at memory hotplug time.
>> -	 * Lock out memory hotplug code while building it.
>> -	 */
>>   	put_online_mems();
>>   out:
>> -	/*
>> -	 * For now both @sysinfo and @cmr_array are only used during
>> -	 * module initialization, so always free them.
>> -	 */
>>   	free_page((unsigned long)sysinfo);
>>   	return ret;
>>   }
> 
> This diff is extremely hard to follow, but I think the change to error
> handling Nikolay proposed has to be applied to the function from the
> beginning, not changed drastically in this patch.
> 


I agree. That change should be broken across the various patches 
introducing each piece of error handling.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-26 14:12 ` [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure Kai Huang
  2023-06-27  9:48   ` kirill.shutemov
  2023-06-28  3:09   ` Chao Gao
@ 2023-06-28 12:58   ` Peter Zijlstra
  2023-06-28 13:54     ` Peter Zijlstra
  2023-06-28 23:21     ` Huang, Kai
  2 siblings, 2 replies; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-28 12:58 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:35AM +1200, Kai Huang wrote:

> +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,

__always_inline perhaps? __always_unused seems wrong, worse it's still
there at the end of the series:

$ quilt diff --combine - | grep seamcall
...
+static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
...
+       ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
+       ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
+       ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
+       ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array),
+       return seamcall(TDH_SYS_KEY_CONFIG, 0, 0, 0, 0, NULL, NULL);
+               ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
...

Definitely not unused.

> +				    u64 *seamcall_ret,
> +				    struct tdx_module_output *out)

This interface is atrocious :/ Why have these two ret values? Why can't
that live in a single space -- /me looks throught the callers, and finds
seamcall_ret is unused :-(

Worse, the input (c,d,8,9) is a strict subset of the output
(c,d,8,9,10,11) so why isn't that a single thing used for both input and
output.

struct tdx_call {
	u64 rcx, rdx, r8, r9, r10, r11;
};

static int __always_inline seamcall(u64 fn, struct tdx_call *regs)
{
}


	struct tdx_regs regs = { };
	ret = seamcall(THD_SYS_INIT, &regs);



	struct tdx_regs regs = {
		.rcx = sysinfo_pa,	.rdx = TDXSYSINFO_STRUCT_SIZE,
		.r8  = cmr_array_pa,	.r9  = MAX_CMRS,
	};
	ret = seamcall(THD_SYS_INFO, &regs);
	if (ret)
		return ret;

	print_cmrs(cmr_array, regs.r9);


/me looks more at this stuff and ... WTF!?!?

Can someone explain to me why __tdx_hypercall() is sane (per the above)
but then we grew __tdx_module_call() as an absolute abomination and are
apparently using that for seam too?




> +{
> +	u64 sret;
> +	int cpu;
> +
> +	/* Need a stable CPU id for printing error message */
> +	cpu = get_cpu();

And that's important because? Does having preemption off across the
seamcall make sense? Does it still make sense when you add a loop later?

> +	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> +	put_cpu();
> +
> +	/* Save SEAMCALL return code if the caller wants it */
> +	if (seamcall_ret)
> +		*seamcall_ret = sret;
> +
> +	switch (sret) {
> +	case 0:
> +		/* SEAMCALL was successful */
> +		return 0;
> +	case TDX_SEAMCALL_VMFAILINVALID:
> +		pr_err_once("module is not loaded.\n");
> +		return -ENODEV;
> +	default:
> +		pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
> +				cpu, fn, sret);
> +		if (out)
> +			pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
> +					out->rcx, out->rdx, out->r8,
> +					out->r9, out->r10, out->r11);

At the very least this lacks { }, but it is quite horrendous coding
style.

Why switch() at all, would not:

	if (!rset)
		return 0;

	if (sret == TDX_SEAMCALL_VMFAILINVALID) {
		pr_nonsense();
		return -ENODEV;
	}

	if (sret == TDX_SEAMCALL_GP) {
		pr_nonsense();
		return -ENODEV;
	}

	if (sret == TDX_SEAMCALL_UD) {
		pr_nonsense();
		return -EINVAL;
	}

	pr_nonsense();
	return -EIO;

be much clearer and have less horrific indenting issues?

> +		return -EIO;
> +	}
> +}

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 06/22] x86/virt/tdx: Handle SEAMCALL running out of entropy error
  2023-06-26 14:12 ` [PATCH v12 06/22] x86/virt/tdx: Handle SEAMCALL running out of entropy error Kai Huang
@ 2023-06-28 13:02   ` Peter Zijlstra
  2023-06-28 23:30     ` Huang, Kai
  0 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-28 13:02 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:36AM +1200, Kai Huang wrote:

>  	cpu = get_cpu();
> -	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> +
> +	/*
> +	 * Certain SEAMCALL leaf functions may return error due to
> +	 * running out of entropy, in which case the SEAMCALL should
> +	 * be retried.  Handle this in SEAMCALL common function.
> +	 *
> +	 * Mimic rdrand_long() retry behavior.

Yeah, except that doesn't have preemption disabled.. you do.

> +	 */
> +	do {
> +		sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> +	} while (sret == TDX_RND_NO_ENTROPY && --retry);
> +
>  	put_cpu();

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-26 14:12 ` [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
  2023-06-26 21:21   ` Sathyanarayanan Kuppuswamy
  2023-06-27  9:50   ` kirill.shutemov
@ 2023-06-28 13:04   ` Peter Zijlstra
  2023-06-29  0:00     ` Huang, Kai
  2023-06-28 13:08   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-28 13:04 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:

> +static int try_init_module_global(void)
> +{
> +	unsigned long flags;
> +	int ret;
> +
> +	/*
> +	 * The TDX module global initialization only needs to be done
> +	 * once on any cpu.
> +	 */
> +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> +
> +	if (tdx_global_initialized) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/* All '0's are just unused parameters. */
> +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> +	if (!ret)
> +		tdx_global_initialized = true;
> +out:
> +	raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
> +
> +	return ret;
> +}

How long does that TDX_SYS_INIT take and why is a raw_spinlock with IRQs
disabled the right way to serialize this?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-26 14:12 ` [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
                     ` (2 preceding siblings ...)
  2023-06-28 13:04   ` Peter Zijlstra
@ 2023-06-28 13:08   ` Peter Zijlstra
  2023-06-29  0:08     ` Huang, Kai
  2023-06-28 13:17   ` Peter Zijlstra
  2023-06-29 11:31   ` David Hildenbrand
  5 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-28 13:08 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:

> +/*
> + * Do the module global initialization if not done yet.
> + * It's always called with interrupts and preemption disabled.
> + */
> +static int try_init_module_global(void)
> +{
> +	unsigned long flags;
> +	int ret;
> +
> +	/*
> +	 * The TDX module global initialization only needs to be done
> +	 * once on any cpu.
> +	 */
> +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> +
> +	if (tdx_global_initialized) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/* All '0's are just unused parameters. */
> +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> +	if (!ret)
> +		tdx_global_initialized = true;
> +out:
> +	raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
> +
> +	return ret;
> +}
> +
> +/**
> + * tdx_cpu_enable - Enable TDX on local cpu
> + *
> + * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
> + * global initialization SEAMCALL if not done) on local cpu to make this
> + * cpu be ready to run any other SEAMCALLs.
> + *
> + * Call this function with preemption disabled.
> + *
> + * Return 0 on success, otherwise errors.
> + */
> +int tdx_cpu_enable(void)
> +{
> +	int ret;
> +
> +	if (!platform_tdx_enabled())
> +		return -ENODEV;
> +
> +	lockdep_assert_preemption_disabled();
> +
> +	/* Already done */
> +	if (__this_cpu_read(tdx_lp_initialized))
> +		return 0;
> +
> +	/*
> +	 * The TDX module global initialization is the very first step
> +	 * to enable TDX.  Need to do it first (if hasn't been done)
> +	 * before the per-cpu initialization.
> +	 */
> +	ret = try_init_module_global();
> +	if (ret)
> +		return ret;
> +
> +	/* All '0's are just unused parameters */
> +	ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
> +	if (ret)
> +		return ret;

And here you do *NOT* have IRQs disabled... so an IRQ can come in here
and do the above again.

I suspect that's a completely insane thing to have happen, but the way
the code is written does not tell me this and might even suggest I
should worry about it, per the above thing actually disabling IRQs.

> +
> +	__this_cpu_write(tdx_lp_initialized, true);
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-26 14:12 ` [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
                     ` (3 preceding siblings ...)
  2023-06-28 13:08   ` Peter Zijlstra
@ 2023-06-28 13:17   ` Peter Zijlstra
  2023-06-29  0:10     ` Huang, Kai
  2023-06-29 11:31   ` David Hildenbrand
  5 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-28 13:17 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> +EXPORT_SYMBOL_GPL(tdx_cpu_enable);

I can't find a single caller of this.. why is this exported?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-28  0:28           ` Huang, Kai
  2023-06-28 11:55             ` kirill.shutemov
@ 2023-06-28 13:35             ` Peter Zijlstra
  2023-06-29  0:15               ` Huang, Kai
  1 sibling, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-28 13:35 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kirill.shutemov, kvm, Raj, Ashok, Huang, Ying, Hansen, Dave,
	david, bagasdotme, ak, Wysocki, Rafael J, linux-kernel, Chatre,
	Reinette, Christopherson,,
	Sean, pbonzini, mingo, Yamahata, Isaku, nik.borisov, tglx, Luck,
	Tony, linux-mm, hpa, imammedo, Shahar, Sagi, bp, Brown, Len, Gao,
	Chao, sathyanarayanan.kuppuswamy, Williams, Dan J, x86

On Wed, Jun 28, 2023 at 12:28:12AM +0000, Huang, Kai wrote:
> On Tue, 2023-06-27 at 22:37 +0000, Huang, Kai wrote:
> > > > 
> > > > +/*
> > > > + * Do the module global initialization if not done yet.
> > > > + * It's always called with interrupts and preemption disabled.
> > > > + */
> > > 
> > > If interrupts are always disabled why do you need _irqsave()?
> > > 
> > 
> > I'll remove the _irqsave().
> > 
> > AFAICT Isaku preferred this for additional security, but this is not
> > necessary.
> > 
> > 
> 
> Damn.  I think we can change the comment to say this function is called with
> preemption being disabled, but _can_ be called with interrupt disabled.  And we
> keep using the _irqsave() version.
> 
> 	/*
> 	 * Do the module global initialization if not done yet.  It's always
> 	 * called with preemption disabled and can be called with interrupts
> 	 * disabled.
> 	 */

That's still not explaining *why*, what you want to say is:

	Can be called locally or through an IPI function call.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-28 12:58   ` Peter Zijlstra
@ 2023-06-28 13:54     ` Peter Zijlstra
  2023-06-28 23:25       ` Huang, Kai
  2023-06-29 10:15       ` kirill.shutemov
  2023-06-28 23:21     ` Huang, Kai
  1 sibling, 2 replies; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-28 13:54 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Wed, Jun 28, 2023 at 02:58:13PM +0200, Peter Zijlstra wrote:

> Can someone explain to me why __tdx_hypercall() is sane (per the above)
> but then we grew __tdx_module_call() as an absolute abomination and are
> apparently using that for seam too?

That is, why do we have two different TDCALL wrappers? Makes no sense.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2023-06-26 14:12 ` [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
  2023-06-27  9:51   ` kirill.shutemov
@ 2023-06-28 14:10   ` Peter Zijlstra
  2023-06-29  9:15     ` Huang, Kai
  1 sibling, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-28 14:10 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:38AM +1200, Kai Huang wrote:
> +static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
> +			   struct cmr_info *cmr_array)
> +{
> +	struct tdx_module_output out;
> +	u64 sysinfo_pa, cmr_array_pa;
> +	int ret;
> +
> +	sysinfo_pa = __pa(sysinfo);
> +	cmr_array_pa = __pa(cmr_array);
> +	ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
> +			cmr_array_pa, MAX_CMRS, NULL, &out);
> +	if (ret)
> +		return ret;
> +
> +	pr_info("TDX module: attributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
> +		sysinfo->attributes,	sysinfo->vendor_id,
> +		sysinfo->major_version, sysinfo->minor_version,
> +		sysinfo->build_date,	sysinfo->build_num);
> +
> +	/* R9 contains the actual entries written to the CMR array. */

So I'm vexed by this comment; it's either not enough or too much.

I mean, as given you assume we all know about the magic parameters to
TDH_SYS_INFO but then somehow need an explanation for how %r9 is changed
from the array size to the number of used entries.

Either describe the whole thing or none of it.

Me, I would prefer all of it, because I've no idea where to begin
looking for any of this, SDM doesn't seem to be the place. That doesn't
even list TDCALL/SEAMCALL in Volume 2 :-( Let alone describe the magic
values.

> +	print_cmrs(cmr_array, out.r9);
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 09/22] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-06-26 14:12 ` [PATCH v12 09/22] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
@ 2023-06-28 14:17   ` Peter Zijlstra
  2023-06-29  0:57     ` Huang, Kai
  2023-07-11 11:38   ` David Hildenbrand
  1 sibling, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-28 14:17 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:39AM +1200, Kai Huang wrote:

> +static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
> +			       void *v)
> +{
> +	struct memory_notify *mn = v;
> +
> +	if (action != MEM_GOING_ONLINE)
> +		return NOTIFY_OK;

So offlining TDX memory is ok?

> +
> +	/*
> +	 * Empty list means TDX isn't enabled.  Allow any memory
> +	 * to go online.
> +	 */
> +	if (list_empty(&tdx_memlist))
> +		return NOTIFY_OK;
> +
> +	/*
> +	 * The TDX memory configuration is static and can not be
> +	 * changed.  Reject onlining any memory which is outside of
> +	 * the static configuration whether it supports TDX or not.
> +	 */
> +	return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ?
> +		NOTIFY_OK : NOTIFY_BAD;

	if (is_tdx_memory(...))
		return NOTIFY_OK;

	return NOTIFY_BAD;

> +}

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-26 14:12 ` [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP Kai Huang
  2023-06-28 12:32   ` kirill.shutemov
@ 2023-06-28 15:29   ` Peter Zijlstra
  2023-06-28 20:38     ` Peter Zijlstra
  2023-06-29 10:00     ` Huang, Kai
  1 sibling, 2 replies; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-28 15:29 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:50AM +1200, Kai Huang wrote:
> diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
> index 49a54356ae99..757b0c34be10 100644
> --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> @@ -1,6 +1,7 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  #include <asm/asm-offsets.h>
>  #include <asm/tdx.h>
> +#include <asm/asm.h>
>  
>  /*
>   * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
> @@ -45,6 +46,7 @@
>  	/* Leave input param 2 in RDX */
>  
>  	.if \host
> +1:
>  	seamcall

So what registers are actually clobbered by SEAMCALL ? There's a
distinct lack of it in SDM Vol.2 instruction list :-(

>  	/*
>  	 * SEAMCALL instruction is essentially a VMExit from VMX root
> @@ -57,10 +59,23 @@
>  	 * This value will never be used as actual SEAMCALL error code as
>  	 * it is from the Reserved status code class.
>  	 */
> -	jnc .Lno_vmfailinvalid
> +	jnc .Lseamcall_out
>  	mov $TDX_SEAMCALL_VMFAILINVALID, %rax
> -.Lno_vmfailinvalid:
> +	jmp .Lseamcall_out
> +2:
> +	/*
> +	 * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
> +	 * the trap number.  Convert the trap number to the TDX error
> +	 * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
> +	 *
> +	 * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
> +	 * only accepts 32-bit immediate at most.
> +	 */
> +	mov $TDX_SW_ERROR, %r12
> +	orq %r12, %rax
>  
> +	_ASM_EXTABLE_FAULT(1b, 2b)
> +.Lseamcall_out:

This is all pretty atrocious code flow... would it at all be possible to
write it like:

SYM_FUNC_START(...)

.if \host
1:	seamcall
	cmovc	%spare, %rax
2:
.else
	tdcall
.endif

	.....
	RET


3:
	mov $TDX_SW_ERROR, %r12
	orq %r12, %rax
	jmp 2b

	_ASM_EXTABLE_FAULT(1b, 3b)

SYM_FUNC_END()

That is, having all that inline in the hotpath is quite horrific.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-28 15:29   ` Peter Zijlstra
@ 2023-06-28 20:38     ` Peter Zijlstra
  2023-06-28 21:11       ` Peter Zijlstra
                         ` (2 more replies)
  2023-06-29 10:00     ` Huang, Kai
  1 sibling, 3 replies; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-28 20:38 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Wed, Jun 28, 2023 at 05:29:01PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 27, 2023 at 02:12:50AM +1200, Kai Huang wrote:
> > diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
> > index 49a54356ae99..757b0c34be10 100644
> > --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> > +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> > @@ -1,6 +1,7 @@
> >  /* SPDX-License-Identifier: GPL-2.0 */
> >  #include <asm/asm-offsets.h>
> >  #include <asm/tdx.h>
> > +#include <asm/asm.h>
> >  
> >  /*
> >   * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
> > @@ -45,6 +46,7 @@
> >  	/* Leave input param 2 in RDX */
> >  
> >  	.if \host
> > +1:
> >  	seamcall
> 
> So what registers are actually clobbered by SEAMCALL ? There's a
> distinct lack of it in SDM Vol.2 instruction list :-(

With the exception of the abomination that is TDH.VP.ENTER all SEAMCALLs
seem to be limited to the set presented here (c,d,8,9,10,11) and all
other registers should be available.

Can we please make that a hard requirement, SEAMCALL must not use
registers outside this? We can hardly program to random future
extentions; we need hard ABI guarantees here.

That also means we should be able to use si,di for the cmovc below.

Kirill, back when we did __tdx_hypercall() we got bp removed as a valid
register, the 1.0 spec still lists that, and it is also listed in
TDH.VP.ENTER, I'm assuming it will be removed there too?

bp must not be used -- it violates the pre-existing calling convention.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-28 20:38     ` Peter Zijlstra
@ 2023-06-28 21:11       ` Peter Zijlstra
  2023-06-28 21:16         ` Peter Zijlstra
  2023-06-29 10:33       ` Huang, Kai
  2023-06-29 11:16       ` kirill.shutemov
  2 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-28 21:11 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Wed, Jun 28, 2023 at 10:38:23PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 28, 2023 at 05:29:01PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 27, 2023 at 02:12:50AM +1200, Kai Huang wrote:
> > > diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
> > > index 49a54356ae99..757b0c34be10 100644
> > > --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> > > +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> > > @@ -1,6 +1,7 @@
> > >  /* SPDX-License-Identifier: GPL-2.0 */
> > >  #include <asm/asm-offsets.h>
> > >  #include <asm/tdx.h>
> > > +#include <asm/asm.h>
> > >  
> > >  /*
> > >   * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
> > > @@ -45,6 +46,7 @@
> > >  	/* Leave input param 2 in RDX */
> > >  
> > >  	.if \host
> > > +1:
> > >  	seamcall
> > 
> > So what registers are actually clobbered by SEAMCALL ? There's a
> > distinct lack of it in SDM Vol.2 instruction list :-(
> 
> With the exception of the abomination that is TDH.VP.ENTER all SEAMCALLs
> seem to be limited to the set presented here (c,d,8,9,10,11) and all
> other registers should be available.
> 
> Can we please make that a hard requirement, SEAMCALL must not use
> registers outside this? We can hardly program to random future
> extentions; we need hard ABI guarantees here.
> 
> That also means we should be able to use si,di for the cmovc below.
> 
> Kirill, back when we did __tdx_hypercall() we got bp removed as a valid
> register, the 1.0 spec still lists that, and it is also listed in
> TDH.VP.ENTER, I'm assuming it will be removed there too?
> 
> bp must not be used -- it violates the pre-existing calling convention.

How's this then? Utterly untested. Not been near a compiler even.

--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -109,10 +109,26 @@ EXPORT_SYMBOL_GPL(tdx_kvm_hypercall);
  * should only be used for calls that have no legitimate reason to fail
  * or where the kernel can not survive the call failing.
  */
-static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
-				   struct tdx_module_output *out)
+static inline void _tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9)
 {
-	if (__tdx_module_call(fn, rcx, rdx, r8, r9, out))
+	struct tdx_module_args args = {
+		.rcx = rcx,
+		.rdx = rdx,
+		.r8  = r8,
+		.r9  = r9,
+	};
+	return __tdx_module_call(fn, &args);
+}
+
+static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9)
+{
+	if (_tdx_module_call(fn, rcx, rdx, r8, r9))
+		panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
+}
+
+static inline void tdx_module_call_ret(u64 fn, struct tdx_module_args *args)
+{
+	if (__tdx_module_call(fn, args))
 		panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
 }
 
@@ -134,9 +150,9 @@ int tdx_mcall_get_report0(u8 *reportdata
 {
 	u64 ret;
 
-	ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
-				virt_to_phys(reportdata), TDREPORT_SUBTYPE_0,
-				0, NULL);
+	ret = _tdx_module_call(TDX_GET_REPORT,
+			       virt_to_phys(tdreport), virt_to_phys(reportdata),
+			       TDREPORT_SUBTYPE_0, 0);
 	if (ret) {
 		if (TDCALL_RETURN_CODE(ret) == TDCALL_INVALID_OPERAND)
 			return -EINVAL;
@@ -184,7 +200,7 @@ static void __noreturn tdx_panic(const c
 
 static void tdx_parse_tdinfo(u64 *cc_mask)
 {
-	struct tdx_module_output out;
+	struct tdx_module_args args = {};
 	unsigned int gpa_width;
 	u64 td_attr;
 
@@ -195,7 +211,7 @@ static void tdx_parse_tdinfo(u64 *cc_mas
 	 * Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
 	 * [TDG.VP.INFO].
 	 */
-	tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
+	tdx_module_call_ret(TDX_GET_INFO, &args);
 
 	/*
 	 * The highest bit of a guest physical address is the "sharing" bit.
@@ -204,7 +220,7 @@ static void tdx_parse_tdinfo(u64 *cc_mas
 	 * The GPA width that comes out of this call is critical. TDX guests
 	 * can not meaningfully run without it.
 	 */
-	gpa_width = out.rcx & GENMASK(5, 0);
+	gpa_width = args.rcx & GENMASK(5, 0);
 	*cc_mask = BIT_ULL(gpa_width - 1);
 
 	/*
@@ -212,7 +228,7 @@ static void tdx_parse_tdinfo(u64 *cc_mas
 	 * memory.  Ensure that no #VE will be delivered for accesses to
 	 * TD-private memory.  Only VMM-shared memory (MMIO) will #VE.
 	 */
-	td_attr = out.rdx;
+	td_attr = args.rdx;
 	if (!(td_attr & ATTR_SEPT_VE_DISABLE)) {
 		const char *msg = "TD misconfiguration: SEPT_VE_DISABLE attribute must be set.";
 
@@ -620,7 +636,7 @@ __init bool tdx_early_handle_ve(struct p
 
 void tdx_get_ve_info(struct ve_info *ve)
 {
-	struct tdx_module_output out;
+	struct tdx_module_args args = {};
 
 	/*
 	 * Called during #VE handling to retrieve the #VE info from the
@@ -637,15 +653,15 @@ void tdx_get_ve_info(struct ve_info *ve)
 	 * Note, the TDX module treats virtual NMIs as inhibited if the #VE
 	 * valid flag is set. It means that NMI=>#VE will not result in a #DF.
 	 */
-	tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);
+	tdx_module_call_ret(TDX_GET_VEINFO, &args);
 
 	/* Transfer the output parameters */
-	ve->exit_reason = out.rcx;
-	ve->exit_qual   = out.rdx;
-	ve->gla         = out.r8;
-	ve->gpa         = out.r9;
-	ve->instr_len   = lower_32_bits(out.r10);
-	ve->instr_info  = upper_32_bits(out.r10);
+	ve->exit_reason = args.rcx;
+	ve->exit_qual   = args.rdx;
+	ve->gla         = args.r8;
+	ve->gpa         = args.r9;
+	ve->instr_len   = lower_32_bits(args.r10);
+	ve->instr_info  = upper_32_bits(args.r10);
 }
 
 /*
@@ -779,7 +795,7 @@ static bool try_accept_one(phys_addr_t *
 	}
 
 	tdcall_rcx = *start | page_size;
-	if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
+	if (_tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0))
 		return false;
 
 	*start += accept_size;
@@ -857,7 +873,7 @@ void __init tdx_early_init(void)
 	cc_set_mask(cc_mask);
 
 	/* Kernel does not use NOTIFY_ENABLES and does not need random #VEs */
-	tdx_module_call(TDX_WR, 0, TDCS_NOTIFY_ENABLES, 0, -1ULL, NULL);
+	tdx_module_call(TDX_WR, 0, TDCS_NOTIFY_ENABLES, 0, -1ULL);
 
 	/*
 	 * All bits above GPA width are reserved and kernel treats shared bit
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -37,7 +37,7 @@
  *
  * This is a software only structure and not part of the TDX module/VMM ABI.
  */
-struct tdx_module_output {
+struct tdx_module_args {
 	u64 rcx;
 	u64 rdx;
 	u64 r8;
@@ -67,8 +67,8 @@ struct ve_info {
 void __init tdx_early_init(void);
 
 /* Used to communicate with the TDX module */
-u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
-		      struct tdx_module_output *out);
+u64 __tdx_module_call(u64 fn, struct tdx_module_args *args);
+u64 __tdx_module_call_ret(u64 fn, struct tdx_module_args *args);
 
 void tdx_get_ve_info(struct ve_info *ve);
 
--- a/arch/x86/virt/vmx/tdx/tdxcall.S
+++ b/arch/x86/virt/vmx/tdx/tdxcall.S
@@ -17,37 +17,44 @@
  *            TDX module and hypercalls to the VMM.
  * SEAMCALL - used by TDX hosts to make requests to the
  *            TDX module.
+ *
+ *-------------------------------------------------------------------------
+ * TDCALL/SEAMCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX                 - TDCALL Leaf number.
+ * RCX,RDX,R8-R9       - TDCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX                 - TDCALL instruction error code.
+ * RCX,RDX,R8-R11      - TDCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_module_call() function ABI:
+ *
+ * @fn   (RDI)         - TDCALL Leaf ID,    moved to RAX
+ * @regs (RSI)         - struct tdx_regs pointer
+ *
+ * Return status of TDCALL via RAX.
  */
-.macro TDX_MODULE_CALL host:req
-	/*
-	 * R12 will be used as temporary storage for struct tdx_module_output
-	 * pointer. Since R12-R15 registers are not used by TDCALL/SEAMCALL
-	 * services supported by this function, it can be reused.
-	 */
+.macro TDX_MODULE_CALL host:req ret:req
+	FRAME_BEGIN
 
-	/* Callee saved, so preserve it */
-	push %r12
+	mov	%rdi, %rax
+	mov	$TDX_SEAMCALL_VMFAILINVALID, %rdi
 
-	/*
-	 * Push output pointer to stack.
-	 * After the operation, it will be fetched into R12 register.
-	 */
-	push %r9
+	mov	TDX_MODULE_rcx(%rsi), %rcx
+	mov	TDX_MODULE_rdx(%rsi), %rdx
+	mov	TDX_MODULE_r8(%rsi),  %r8
+	mov	TDX_MODULE_r9(%rsi),  %r9
+//	mov	TDX_MODULE_r10(%rsi), %r10
+//	mov	TDX_MODULE_r11(%rsi), %r11
 
-	/* Mangle function call ABI into TDCALL/SEAMCALL ABI: */
-	/* Move Leaf ID to RAX */
-	mov %rdi, %rax
-	/* Move input 4 to R9 */
-	mov %r8,  %r9
-	/* Move input 3 to R8 */
-	mov %rcx, %r8
-	/* Move input 1 to RCX */
-	mov %rsi, %rcx
-	/* Leave input param 2 in RDX */
-
-	.if \host
-1:
-	seamcall
+.if \host
+1:	seamcall
 	/*
 	 * SEAMCALL instruction is essentially a VMExit from VMX root
 	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
@@ -59,53 +66,30 @@
 	 * This value will never be used as actual SEAMCALL error code as
 	 * it is from the Reserved status code class.
 	 */
-	jnc .Lseamcall_out
-	mov $TDX_SEAMCALL_VMFAILINVALID, %rax
-	jmp .Lseamcall_out
+	cmovc	%rdi, %rax
 2:
-	/*
-	 * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
-	 * the trap number.  Convert the trap number to the TDX error
-	 * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
-	 *
-	 * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
-	 * only accepts 32-bit immediate at most.
-	 */
-	mov $TDX_SW_ERROR, %r12
-	orq %r12, %rax
-
-	_ASM_EXTABLE_FAULT(1b, 2b)
-.Lseamcall_out:
-	.else
+.else
 	tdcall
-	.endif
-
-	/*
-	 * Fetch output pointer from stack to R12 (It is used
-	 * as temporary storage)
-	 */
-	pop %r12
+.endif
 
-	/*
-	 * Since this macro can be invoked with NULL as an output pointer,
-	 * check if caller provided an output struct before storing output
-	 * registers.
-	 *
-	 * Update output registers, even if the call failed (RAX != 0).
-	 * Other registers may contain details of the failure.
-	 */
-	test %r12, %r12
-	jz .Lno_output_struct
+.if \ret
+	movq %rcx, TDX_MODULE_rcx(%rsi)
+	movq %rdx, TDX_MODULE_rdx(%rsi)
+	movq %r8,  TDX_MODULE_r8(%rsi)
+	movq %r9,  TDX_MODULE_r9(%rsi)
+	movq %r10, TDX_MODULE_r10(%rsi)
+	movq %r11, TDX_MODULE_r11(%rsi)
+.endif
+
+	FRAME_END
+	RET
+
+.if \host
+3:
+	mov	$TDX_SW_ERROR, %rdi
+	or	%rdi, %rax
+	jmp 2b
 
-	/* Copy result registers to output struct: */
-	movq %rcx, TDX_MODULE_rcx(%r12)
-	movq %rdx, TDX_MODULE_rdx(%r12)
-	movq %r8,  TDX_MODULE_r8(%r12)
-	movq %r9,  TDX_MODULE_r9(%r12)
-	movq %r10, TDX_MODULE_r10(%r12)
-	movq %r11, TDX_MODULE_r11(%r12)
-
-.Lno_output_struct:
-	/* Restore the state of R12 register */
-	pop %r12
+	_ASM_EXTABLE_FAULT(1b, 3b)
+.endif
 .endm

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-28 21:11       ` Peter Zijlstra
@ 2023-06-28 21:16         ` Peter Zijlstra
  2023-06-30  9:03           ` kirill.shutemov
  0 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-28 21:16 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Wed, Jun 28, 2023 at 11:11:32PM +0200, Peter Zijlstra wrote:
> --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> @@ -17,37 +17,44 @@
>   *            TDX module and hypercalls to the VMM.
>   * SEAMCALL - used by TDX hosts to make requests to the
>   *            TDX module.
> + *
> + *-------------------------------------------------------------------------
> + * TDCALL/SEAMCALL ABI:
> + *-------------------------------------------------------------------------
> + * Input Registers:
> + *
> + * RAX                 - TDCALL Leaf number.
> + * RCX,RDX,R8-R9       - TDCALL Leaf specific input registers.
> + *
> + * Output Registers:
> + *
> + * RAX                 - TDCALL instruction error code.
> + * RCX,RDX,R8-R11      - TDCALL Leaf specific output registers.
> + *
> + *-------------------------------------------------------------------------
> + *
> + * __tdx_module_call() function ABI:
> + *
> + * @fn   (RDI)         - TDCALL Leaf ID,    moved to RAX
> + * @regs (RSI)         - struct tdx_regs pointer
> + *
> + * Return status of TDCALL via RAX.
>   */
> +.macro TDX_MODULE_CALL host:req ret:req
> +	FRAME_BEGIN
>  
> +	mov	%rdi, %rax
> +	mov	$TDX_SEAMCALL_VMFAILINVALID, %rdi
>  
> +	mov	TDX_MODULE_rcx(%rsi), %rcx
> +	mov	TDX_MODULE_rdx(%rsi), %rdx
> +	mov	TDX_MODULE_r8(%rsi),  %r8
> +	mov	TDX_MODULE_r9(%rsi),  %r9
> +//	mov	TDX_MODULE_r10(%rsi), %r10
> +//	mov	TDX_MODULE_r11(%rsi), %r11
>  
> +.if \host
> +1:	seamcall
>  	/*
>  	 * SEAMCALL instruction is essentially a VMExit from VMX root
>  	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
	...
>  	 * This value will never be used as actual SEAMCALL error code as
>  	 * it is from the Reserved status code class.
>  	 */
> +	cmovc	%rdi, %rax
>  2:
> +.else
>  	tdcall
> +.endif
>  
> +.if \ret
> +	movq %rcx, TDX_MODULE_rcx(%rsi)
> +	movq %rdx, TDX_MODULE_rdx(%rsi)
> +	movq %r8,  TDX_MODULE_r8(%rsi)
> +	movq %r9,  TDX_MODULE_r9(%rsi)
> +	movq %r10, TDX_MODULE_r10(%rsi)
> +	movq %r11, TDX_MODULE_r11(%rsi)
> +.endif
> +
> +	FRAME_END
> +	RET
> +
> +.if \host
> +3:
> +	mov	$TDX_SW_ERROR, %rdi
> +	or	%rdi, %rax
> +	jmp 2b
>  
> +	_ASM_EXTABLE_FAULT(1b, 3b)
> +.endif
>  .endm

Isn't that much simpler?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-28 12:58   ` Peter Zijlstra
  2023-06-28 13:54     ` Peter Zijlstra
@ 2023-06-28 23:21     ` Huang, Kai
  2023-06-29  3:40       ` Huang, Kai
  1 sibling, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-28 23:21 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Wed, 2023-06-28 at 14:58 +0200, Peter Zijlstra wrote:
> On Tue, Jun 27, 2023 at 02:12:35AM +1200, Kai Huang wrote:
> 
> > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> 
> __always_inline perhaps? __always_unused seems wrong, worse it's still
> there at the end of the series:
> 
> $ quilt diff --combine - | grep seamcall
> ...
> +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> ...
> +       ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> +       ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
> +       ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
> +       ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array),
> +       return seamcall(TDH_SYS_KEY_CONFIG, 0, 0, 0, 0, NULL, NULL);
> +               ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
> ...
> 
> Definitely not unused.

Thanks for reviewing!

Sorry obviously I forgot to remove __always_unused in the patch that firstly
used seamcall().  Should be more careful. :(

> 
> > +				    u64 *seamcall_ret,
> > +				    struct tdx_module_output *out)
> 
> This interface is atrocious :/ Why have these two ret values? Why can't
> that live in a single space -- /me looks throught the callers, and finds
> seamcall_ret is unused :-(

I'll @seamcall_ret as also suggested by Kirill.

> 
> Worse, the input (c,d,8,9) is a strict subset of the output
> (c,d,8,9,10,11) so why isn't that a single thing used for both input and
> output.
> 
> struct tdx_call {
> 	u64 rcx, rdx, r8, r9, r10, r11;
> };
> 
> static int __always_inline seamcall(u64 fn, struct tdx_call *regs)
> {
> }
> 
> 
> 	struct tdx_regs regs = { };
> 	ret = seamcall(THD_SYS_INIT, &regs);
> 
> 
> 
> 	struct tdx_regs regs = {
> 		.rcx = sysinfo_pa,	.rdx = TDXSYSINFO_STRUCT_SIZE,
> 		.r8  = cmr_array_pa,	.r9  = MAX_CMRS,
> 	};
> 	ret = seamcall(THD_SYS_INFO, &regs);
> 	if (ret)
> 		return ret;
> 
> 	print_cmrs(cmr_array, regs.r9);
> 
> 
> /me looks more at this stuff and ... WTF!?!?
> 
> Can someone explain to me why __tdx_hypercall() is sane (per the above)
> but then we grew __tdx_module_call() as an absolute abomination and are
> apparently using that for seam too?
> 
> 

Sorry I don't know the story behind __tdx_hypercall().

For TDCALL and SEAMCALL, I believe one reason is they can be used in performance
critical path.  The @out is not always used, so putting all outputs to a
structure can reduce the number of function parameters. I once had separate
struct tdx_seamcall_input {} and struct tdx_seamcall_out {} but wasn't
preferred.

Kirill, could you help to explain?

> 
> 
> > +{
> > +	u64 sret;
> > +	int cpu;
> > +
> > +	/* Need a stable CPU id for printing error message */
> > +	cpu = get_cpu();
> 
> And that's important because? 
> 

I want to have a stable cpu for error message printing.

> Does having preemption off across the
> seamcall make sense? Does it still make sense when you add a loop later?

SEAMCALL itself isn't interruptible, so I think having preemption off around
SEAMCALL is fine.  But I agree disabling preemption around multiple SEAMCALL
isn't ideal.  I'll change that to only disable preemption around one SEAMCALL to
get a correct CPU id for error printing.

> 
> > +	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > +	put_cpu();
> > +
> > +	/* Save SEAMCALL return code if the caller wants it */
> > +	if (seamcall_ret)
> > +		*seamcall_ret = sret;
> > +
> > +	switch (sret) {
> > +	case 0:
> > +		/* SEAMCALL was successful */
> > +		return 0;
> > +	case TDX_SEAMCALL_VMFAILINVALID:
> > +		pr_err_once("module is not loaded.\n");
> > +		return -ENODEV;
> > +	default:
> > +		pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n",
> > +				cpu, fn, sret);
> > +		if (out)
> > +			pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n",
> > +					out->rcx, out->rdx, out->r8,
> > +					out->r9, out->r10, out->r11);
> 
> At the very least this lacks { }, but it is quite horrendous coding
> style.
> 
> Why switch() at all, would not:
> 
> 	if (!rset)
> 		return 0;
> 
> 	if (sret == TDX_SEAMCALL_VMFAILINVALID) {
> 		pr_nonsense();
> 		return -ENODEV;
> 	}
> 
> 	if (sret == TDX_SEAMCALL_GP) {
> 		pr_nonsense();
> 		return -ENODEV;
> 	}
> 
> 	if (sret == TDX_SEAMCALL_UD) {
> 		pr_nonsense();
> 		return -EINVAL;
> 	}
> 
> 	pr_nonsense();
> 	return -EIO;
> 
> be much clearer and have less horrific indenting issues?

I can certainly change to this style.  Thanks.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-28 13:54     ` Peter Zijlstra
@ 2023-06-28 23:25       ` Huang, Kai
  2023-06-29 10:15       ` kirill.shutemov
  1 sibling, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-28 23:25 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Wed, 2023-06-28 at 15:54 +0200, Peter Zijlstra wrote:
> On Wed, Jun 28, 2023 at 02:58:13PM +0200, Peter Zijlstra wrote:
> 
> > Can someone explain to me why __tdx_hypercall() is sane (per the above)
> > but then we grew __tdx_module_call() as an absolute abomination and are
> > apparently using that for seam too?
> 
> That is, why do we have two different TDCALL wrappers? Makes no sense.
> 
I think the reason should be TDCALL/SEAMCALL can be used in performance critical
path, but TDVMCALL isn't.

For example, SEAMCALLs are used in KVM's MMU code to handle page fault for TDX
private pages.

Kirill, could you help to clarify?  Thanks.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 06/22] x86/virt/tdx: Handle SEAMCALL running out of entropy error
  2023-06-28 13:02   ` Peter Zijlstra
@ 2023-06-28 23:30     ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-28 23:30 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Wed, 2023-06-28 at 15:02 +0200, Peter Zijlstra wrote:
> On Tue, Jun 27, 2023 at 02:12:36AM +1200, Kai Huang wrote:
> 
> >  	cpu = get_cpu();
> > -	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > +
> > +	/*
> > +	 * Certain SEAMCALL leaf functions may return error due to
> > +	 * running out of entropy, in which case the SEAMCALL should
> > +	 * be retried.  Handle this in SEAMCALL common function.
> > +	 *
> > +	 * Mimic rdrand_long() retry behavior.
> 
> Yeah, except that doesn't have preemption disabled.. you do.
> 

Agreed.  I'll change to only disable preemption around one SEAMCALL (for error
printing CPU id).

But doing this, it makes more sense to split this wrapper function out as a
separate patch and put it after the skeleton patch since this way we require the
caller to guarantee all online cpus must have been in VMX operation (SEAMCALL
requires CPU must be in VMX operation), which is the assumption that
tdx_enable() has anyway.

Hi Kirill/Dave/David,

Please let me know if you have comments?

> > +	 */
> > +	do {
> > +		sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > +	} while (sret == TDX_RND_NO_ENTROPY && --retry);
> > +
> >  	put_cpu();


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-28 11:50       ` kirill.shutemov
@ 2023-06-28 23:31         ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-28 23:31 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: kvm, x86, Raj, Ashok, Hansen, Dave, david, bagasdotme, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, tglx, linux-mm, Yamahata,
	Isaku, mingo, Luck, Tony, hpa, peterz, nik.borisov, imammedo,
	Gao, Chao, bp, Brown, Len, Shahar, Sagi,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J

On Wed, 2023-06-28 at 14:50 +0300, kirill.shutemov@linux.intel.com wrote:
> On Wed, Jun 28, 2023 at 03:34:05AM +0000, Huang, Kai wrote:
> > On Wed, 2023-06-28 at 11:09 +0800, Chao Gao wrote:
> > > > +/*
> > > > + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> > > > + * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> > > > + * leaf function return code and the additional output respectively if
> > > > + * not NULL.
> > > > + */
> > > > +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64
> > > > r9,
> > > > +				    u64 *seamcall_ret,
> > > > +				    struct tdx_module_output *out)
> > > > +{
> > > > +	u64 sret;
> > > > +	int cpu;
> > > > +
> > > > +	/* Need a stable CPU id for printing error message */
> > > > +	cpu = get_cpu();
> > > > +	sret = __seamcall(fn, rcx, rdx, r8, r9, out);
> > > > +	put_cpu();
> > > > +
> > > > +	/* Save SEAMCALL return code if the caller wants it */
> > > > +	if (seamcall_ret)
> > > > +		*seamcall_ret = sret;
> > > 
> > > Hi Kai,
> > > 
> > > All callers in this series pass NULL for seamcall_ret. I am no sure if
> > > you keep it intentionally.
> > 
> > In this series all the callers doesn't need seamcall_ret.
> 
> I'm fine keeping it if it is needed by KVM TDX enabling. Otherwise, just
> drop it.

No problem I'll drop it.  KVM is using __seamcall() anyway.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-28 13:04   ` Peter Zijlstra
@ 2023-06-29  0:00     ` Huang, Kai
  2023-06-30  9:25       ` Peter Zijlstra
  0 siblings, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-29  0:00 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Wed, 2023-06-28 at 15:04 +0200, Peter Zijlstra wrote:
> On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> 
> > +static int try_init_module_global(void)
> > +{
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	/*
> > +	 * The TDX module global initialization only needs to be done
> > +	 * once on any cpu.
> > +	 */
> > +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> > +
> > +	if (tdx_global_initialized) {
> > +		ret = 0;
> > +		goto out;
> > +	}
> > +
> > +	/* All '0's are just unused parameters. */
> > +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > +	if (!ret)
> > +		tdx_global_initialized = true;
> > +out:
> > +	raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
> > +
> > +	return ret;
> > +}
> 
> How long does that TDX_SYS_INIT take and why is a raw_spinlock with IRQs
> disabled the right way to serialize this?

The spec says it doesn't have a latency requirement, so theoretically it could
be long.  SEAMCALL is a VMEXIT so it would at least cost thousands of cycles.

If raw_spinlock isn't desired, I think I can introduce another function to do
this and let the caller to call it before calling tdx_cpu_enable().  E.g., we
can have below functions:

1) tdx_global_init()	-> TDH_SYS_INIT
2) tdx_cpu_init()	-> TDH_SYS_LP_INIT
3) tdx_enable()		-> actual module initialization

How does this sound?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-28 13:08   ` Peter Zijlstra
@ 2023-06-29  0:08     ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-29  0:08 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Wed, 2023-06-28 at 15:08 +0200, Peter Zijlstra wrote:
> On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> 
> > +/*
> > + * Do the module global initialization if not done yet.
> > + * It's always called with interrupts and preemption disabled.
> > + */
> > +static int try_init_module_global(void)
> > +{
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	/*
> > +	 * The TDX module global initialization only needs to be done
> > +	 * once on any cpu.
> > +	 */
> > +	raw_spin_lock_irqsave(&tdx_global_init_lock, flags);
> > +
> > +	if (tdx_global_initialized) {
> > +		ret = 0;
> > +		goto out;
> > +	}
> > +
> > +	/* All '0's are just unused parameters. */
> > +	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
> > +	if (!ret)
> > +		tdx_global_initialized = true;
> > +out:
> > +	raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags);
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * tdx_cpu_enable - Enable TDX on local cpu
> > + *
> > + * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module
> > + * global initialization SEAMCALL if not done) on local cpu to make this
> > + * cpu be ready to run any other SEAMCALLs.
> > + *
> > + * Call this function with preemption disabled.
> > + *
> > + * Return 0 on success, otherwise errors.
> > + */
> > +int tdx_cpu_enable(void)
> > +{
> > +	int ret;
> > +
> > +	if (!platform_tdx_enabled())
> > +		return -ENODEV;
> > +
> > +	lockdep_assert_preemption_disabled();
> > +
> > +	/* Already done */
> > +	if (__this_cpu_read(tdx_lp_initialized))
> > +		return 0;
> > +
> > +	/*
> > +	 * The TDX module global initialization is the very first step
> > +	 * to enable TDX.  Need to do it first (if hasn't been done)
> > +	 * before the per-cpu initialization.
> > +	 */
> > +	ret = try_init_module_global();
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* All '0's are just unused parameters */
> > +	ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL);
> > +	if (ret)
> > +		return ret;
> 
> And here you do *NOT* have IRQs disabled... so an IRQ can come in here
> and do the above again.
> 
> I suspect that's a completely insane thing to have happen, but the way
> the code is written does not tell me this and might even suggest I
> should worry about it, per the above thing actually disabling IRQs.
> 

I can change lockdep_assert_preemption_disabled() to
lockdep_assert_irqs_disabled(), making this function only being called from IPI.
As Kirill also suggested we can do this way as for now KVM is the only user of
this function and it enables hardware for all cpus via IPI.
> 

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-28 13:17   ` Peter Zijlstra
@ 2023-06-29  0:10     ` Huang, Kai
  2023-06-30  9:26       ` Peter Zijlstra
  0 siblings, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-29  0:10 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote:
> On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> 
> I can't find a single caller of this.. why is this exported?

It's for KVM TDX patch to use, which isn't in this series.

I'll remove the export.  KVM TDX series can export it.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-28 13:35             ` Peter Zijlstra
@ 2023-06-29  0:15               ` Huang, Kai
  2023-06-30  9:22                 ` Peter Zijlstra
  0 siblings, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-29  0:15 UTC (permalink / raw)
  To: peterz
  Cc: kvm, x86, Raj, Ashok, Hansen, Dave, david, bagasdotme, ak,
	Wysocki, Rafael J, linux-kernel, Chatre, Reinette, mingo,
	pbonzini, Christopherson,,
	Sean, Yamahata, Isaku, nik.borisov, tglx, Luck, Tony,
	kirill.shutemov, hpa, imammedo, sathyanarayanan.kuppuswamy,
	linux-mm, bp, Brown, Len, Shahar, Sagi, Huang, Ying, Williams,
	Dan J, Gao, Chao

On Wed, 2023-06-28 at 15:35 +0200, Peter Zijlstra wrote:
> On Wed, Jun 28, 2023 at 12:28:12AM +0000, Huang, Kai wrote:
> > On Tue, 2023-06-27 at 22:37 +0000, Huang, Kai wrote:
> > > > > 
> > > > > +/*
> > > > > + * Do the module global initialization if not done yet.
> > > > > + * It's always called with interrupts and preemption disabled.
> > > > > + */
> > > > 
> > > > If interrupts are always disabled why do you need _irqsave()?
> > > > 
> > > 
> > > I'll remove the _irqsave().
> > > 
> > > AFAICT Isaku preferred this for additional security, but this is not
> > > necessary.
> > > 
> > > 
> > 
> > Damn.  I think we can change the comment to say this function is called with
> > preemption being disabled, but _can_ be called with interrupt disabled.  And we
> > keep using the _irqsave() version.
> > 
> > 	/*
> > 	 * Do the module global initialization if not done yet.  It's always
> > 	 * called with preemption disabled and can be called with interrupts
> > 	 * disabled.
> > 	 */
> 
> That's still not explaining *why*, what you want to say is:
> 
> 	Can be called locally or through an IPI function call.
> 

Thanks.  As in another reply, if using spinlock is OK, then I think we can say
it will be called either locally or through an IPI function call.  Otherwise, we
do via a new separate function tdx_global_init() and no lock is needed in that
function.  The caller should call it properly.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful
  2023-06-28 12:48     ` Nikolay Borisov
@ 2023-06-29  0:24       ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-29  0:24 UTC (permalink / raw)
  To: kirill.shutemov, nik.borisov
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Yamahata, Isaku, Chatre,
	Reinette, hpa, peterz, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, x86,
	Williams, Dan J

On Wed, 2023-06-28 at 15:48 +0300, Nikolay Borisov wrote:
> > This diff is extremely hard to follow, but I think the change to error
> > handling Nikolay proposed has to be applied to the function from the
> > beginning, not changed drastically in this patch.
> > 
> 
> 
> I agree. That change should be broken across the various patches 
> introducing each piece of error handling.

No I don't want to do this.  The TDMRs are only needed to be saved if we want to
do the next patch (reset TDX memory).  They are always freed in previous patch.
We can add justification to keep in previous patch but I now want to avoid such
pattern because I now believe it's not the right way to organize patches:

Obviously such justification depends on the later patch.  In case the later
patch has something wrong and needs to be updated, the justification can be
invalid, and we need to adjust the previous patches accordingly.  This could
result in code review frustration.

Specifically for this issue, if we always free TDMRs in previous patches, then 
it's just not right to do what you suggested there.  Also, now with dynamic
allocation of TDSYSINFO_STRUCT and CMR array, we need to do 3 things when module
initialization is successful:

	put_online_mems();
	kfree(sysinfo);
	kfree(cmr_array);
	return 0;
out_xxx:
	....
	put_online_mems();
	kfree(sysinfo);
	kfree(cmr_array);
	return ret;

I can hardly say which is better.  I am willing to do the above pattern if you
guys prefer but I certainly don't want to mix this logic to previous patches.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum
  2023-06-28 12:29   ` kirill.shutemov
@ 2023-06-29  0:27     ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-29  0:27 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Yamahata, Isaku, Chatre,
	Reinette, nik.borisov, hpa, peterz, Shahar, Sagi, imammedo, bp,
	Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
	x86, Williams, Dan J

On Wed, 2023-06-28 at 15:29 +0300, kirill.shutemov@linux.intel.com wrote:
> On Tue, Jun 27, 2023 at 02:12:49AM +1200, Kai Huang wrote:
> > @@ -1113,6 +1115,17 @@ static int init_tdx_module(void)
> >  	 */
> >  	wbinvd_on_all_cpus();
> >  
> > +	/*
> > +	 * Starting from this point the system may have TDX private
> > +	 * memory.  Make it globally visible so tdx_reset_memory() only
> > +	 * reads TDMRs/PAMTs when they are stable.
> > +	 *
> > +	 * Note using atomic_inc_return() to provide the explicit memory
> > +	 * ordering isn't mandatory here as the WBINVD above already
> > +	 * does that.  Compiler barrier isn't needed here either.
> > +	 */
> > +	atomic_inc_return(&tdx_may_has_private_mem);
> 
> Why do we need atomics at all here? Writers seems serialized with
> tdx_module_lock and reader accesses the variable when all CPUs, but one is
> down and cannot race.
> 

In kexec() the reader reads this when all remote cpus are dead w/o holding
module lock.  All remote cpus can be stopped at _ANY_ time, meaning they can be
stopped right in any place in middle of init_tdx_module().  IIUC the module lock
doesn't help here.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum
  2023-06-28  9:20   ` Nikolay Borisov
@ 2023-06-29  0:32     ` Dave Hansen
  2023-06-29  0:58       ` Huang, Kai
  2023-06-29  3:19     ` Huang, Kai
  1 sibling, 1 reply; 159+ messages in thread
From: Dave Hansen @ 2023-06-29  0:32 UTC (permalink / raw)
  To: Nikolay Borisov, Kai Huang, linux-kernel, kvm
  Cc: linux-mm, x86, kirill.shutemov, tony.luck, peterz, tglx, bp,
	mingo, hpa, seanjc, pbonzini, david, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	bagasdotme, sagis, imammedo

On 6/28/23 02:20, Nikolay Borisov wrote:
>>
>>   +    /*
>> +     * Starting from this point the system may have TDX private
>> +     * memory.  Make it globally visible so tdx_reset_memory() only
>> +     * reads TDMRs/PAMTs when they are stable.
>> +     *
>> +     * Note using atomic_inc_return() to provide the explicit memory
>> +     * ordering isn't mandatory here as the WBINVD above already
>> +     * does that.  Compiler barrier isn't needed here either.
>> +     */
> 
> If it's not needed, then why use it? Simply do atomic_inc() and instead
> rephrase the comment to state what are the ordering guarantees and how
> they are achieved (i.e by using wbinvd above).

Even better, explain why the barrier needs to be there and *IGNORE* the
WBVIND.

If the WBINVD gets moved -- or if the gods ever bless us with a halfway
reasonable way to flush the caches that's not full serializing -- this
code is screwed.

There is _zero_ reason to try and "optimize" this junk by trying to get
rid of a memory barrier at the risk of screwing it over later.

I use "optimize" in quotes because that's a highly charitable way of
describing this activity.



^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 09/22] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-06-28 14:17   ` Peter Zijlstra
@ 2023-06-29  0:57     ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-29  0:57 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Wed, 2023-06-28 at 16:17 +0200, Peter Zijlstra wrote:
> On Tue, Jun 27, 2023 at 02:12:39AM +1200, Kai Huang wrote:
> 
> > +static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
> > +			       void *v)
> > +{
> > +	struct memory_notify *mn = v;
> > +
> > +	if (action != MEM_GOING_ONLINE)
> > +		return NOTIFY_OK;
> 
> So offlining TDX memory is ok?

Yes.  We want to support normal software memory hotplug logic even TDX is
enabled.  User can offline part of memory and then online again.

> 
> > +
> > +	/*
> > +	 * Empty list means TDX isn't enabled.  Allow any memory
> > +	 * to go online.
> > +	 */
> > +	if (list_empty(&tdx_memlist))
> > +		return NOTIFY_OK;
> > +
> > +	/*
> > +	 * The TDX memory configuration is static and can not be
> > +	 * changed.  Reject onlining any memory which is outside of
> > +	 * the static configuration whether it supports TDX or not.
> > +	 */
> > +	return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ?
> > +		NOTIFY_OK : NOTIFY_BAD;
> 
> 	if (is_tdx_memory(...))
> 		return NOTIFY_OK;
> 
> 	return NOTIFY_BAD;
> 

Sure will do.  Thanks!
> 

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum
  2023-06-29  0:32     ` Dave Hansen
@ 2023-06-29  0:58       ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-29  0:58 UTC (permalink / raw)
  To: kvm, Hansen, Dave, nik.borisov, linux-kernel
  Cc: Raj, Ashok, Luck, Tony, david, bagasdotme, ak, Wysocki, Rafael J,
	kirill.shutemov, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, linux-mm, tglx, Yamahata, Isaku, mingo, hpa,
	peterz, Shahar, Sagi, imammedo, bp, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, x86, Williams, Dan J

On Wed, 2023-06-28 at 17:32 -0700, Dave Hansen wrote:
> On 6/28/23 02:20, Nikolay Borisov wrote:
> > > 
> > >   +    /*
> > > +     * Starting from this point the system may have TDX private
> > > +     * memory.  Make it globally visible so tdx_reset_memory() only
> > > +     * reads TDMRs/PAMTs when they are stable.
> > > +     *
> > > +     * Note using atomic_inc_return() to provide the explicit memory
> > > +     * ordering isn't mandatory here as the WBINVD above already
> > > +     * does that.  Compiler barrier isn't needed here either.
> > > +     */
> > 
> > If it's not needed, then why use it? Simply do atomic_inc() and instead
> > rephrase the comment to state what are the ordering guarantees and how
> > they are achieved (i.e by using wbinvd above).
> 
> Even better, explain why the barrier needs to be there and *IGNORE* the
> WBVIND.
> 
> If the WBINVD gets moved -- or if the gods ever bless us with a halfway
> reasonable way to flush the caches that's not full serializing -- this
> code is screwed.
> 
> There is _zero_ reason to try and "optimize" this junk by trying to get
> rid of a memory barrier at the risk of screwing it over later.
> 
> I use "optimize" in quotes because that's a highly charitable way of
> describing this activity.
> 

Agreed.  I'll try to explain this well and come back.

Thanks!

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 00/22] TDX host kernel support
  2023-06-28  8:12   ` Huang, Kai
@ 2023-06-29  1:01     ` Yuan Yao
  0 siblings, 0 replies; 159+ messages in thread
From: Yuan Yao @ 2023-06-29  1:01 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, peterz, Shahar,
	Sagi, imammedo, bp, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J, x86

On Wed, Jun 28, 2023 at 08:12:55AM +0000, Huang, Kai wrote:
> > >
> > > 2. CPU hotplug
> > >
> > > DX doesn't support physical (ACPI) CPU hotplug.  A non-buggy BIOS should
> >   ^^
> >
> > Need T here.
>
> Thanks!
>
> >
> [...]
>
> > > 4. Memory Hotplug
> > >
> > > After the kernel passes all "TDX-usable" memory regions to the TDX
> > > module, the set of "TDX-usable" memory regions are fixed during module's
> > > runtime.  No more "TDX-usable" memory can be added to the TDX module
> > > after that.
> > >
> > > To achieve above "to guarantee all pages in the page allocator are TDX
> > > pages", this series simply choose to reject any non-TDX-usable memory in
> > > memory hotplug.
> > >
> > > 5. Physical Memory Hotplug
> > >
> > > Note TDX assumes convertible memory is always physically present during
> > > machine's runtime.  A non-buggy BIOS should never support hot-removal of
> > > any convertible memory.  This implementation doesn't handle ACPI memory
> > > removal but depends on the BIOS to behave correctly.
> > >
> > > Also, if something insane really happened, 4) makes sure either TDX
> >
> > Please remove "4)" if have no specific meaning here.
> >
>
> It means the mechanism mentioned in "4. Memory hotplug".

Ah I see, it's fine to me, thanks.

> >

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful
  2023-06-28  9:04   ` Nikolay Borisov
@ 2023-06-29  1:03     ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-29  1:03 UTC (permalink / raw)
  To: kvm, nik.borisov, linux-kernel
  Cc: Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, kirill.shutemov, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Yamahata, Isaku, Chatre,
	Reinette, hpa, peterz, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, x86,
	Williams, Dan J

On Wed, 2023-06-28 at 12:04 +0300, Nikolay Borisov wrote:
> 
> On 26.06.23 г. 17:12 ч., Kai Huang wrote:
> > On the platforms with the "partial write machine check" erratum, the
> > kexec() needs to convert all TDX private pages back to normal before
> > booting to the new kernel.  Otherwise, the new kernel may get unexpected
> > machine check.
> > 
> > There's no existing infrastructure to track TDX private pages.  Change
> > to keep TDMRs when module initialization is successful so that they can
> > be used to find PAMTs.
> > 
> > With this change, only put_online_mems() and freeing the buffer of the
> > TDSYSINFO_STRUCT and CMR array still need to be done even when module
> > initialization is successful.  Adjust the error handling to explicitly
> > do them when module initialization is successful and unconditionally
> > clean up the rest when initialization fails.
> > 
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> > 
> > v11 -> v12 (new patch):
> >    - Defer keeping TDMRs logic to this patch for better review
> >    - Improved error handling logic (Nikolay/Kirill in patch 15)
> > 
> > ---
> >   arch/x86/virt/vmx/tdx/tdx.c | 84 ++++++++++++++++++-------------------
> >   1 file changed, 42 insertions(+), 42 deletions(-)
> > 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 52b7267ea226..85b24b2e9417 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -49,6 +49,8 @@ static DEFINE_MUTEX(tdx_module_lock);
> >   /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
> >   static LIST_HEAD(tdx_memlist);
> >   
> > +static struct tdmr_info_list tdx_tdmr_list;
> > +
> >   /*
> >    * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> >    * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> > @@ -1047,7 +1049,6 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list)
> >   static int init_tdx_module(void)
> >   {
> >   	struct tdsysinfo_struct *sysinfo;
> > -	struct tdmr_info_list tdmr_list;
> >   	struct cmr_info *cmr_array;
> >   	int ret;
> >   
> > @@ -1088,17 +1089,17 @@ static int init_tdx_module(void)
> >   		goto out_put_tdxmem;
> >   
> >   	/* Allocate enough space for constructing TDMRs */
> > -	ret = alloc_tdmr_list(&tdmr_list, sysinfo);
> > +	ret = alloc_tdmr_list(&tdx_tdmr_list, sysinfo);
> >   	if (ret)
> >   		goto out_free_tdxmem;
> >   
> >   	/* Cover all TDX-usable memory regions in TDMRs */
> > -	ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo);
> > +	ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, sysinfo);
> 
> nit: Does it make sense to keep passing those global variables are 
> function parameters? Since those functions are static it's unlikely that 
> they are going to be used with any other parameter so might as well use 
> the parameter directly. It makes the code somewhat easier to follow.
> 

I disagree.  To me passing 'struct tdx_tdmr_info *tdmr_list' to
construct_tdmrs() as parameter makes this function clearer:

It takes all TDX memory blocks and sysinfo, generates the TDMRs, and stores them
to the buffer specified in the tdmr_list.  The internal logic doesn't need to
care whether any of of those parameters are static or not. 

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum
  2023-06-28  9:20   ` Nikolay Borisov
  2023-06-29  0:32     ` Dave Hansen
@ 2023-06-29  3:19     ` Huang, Kai
  2023-06-29  5:38       ` Huang, Kai
  1 sibling, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-29  3:19 UTC (permalink / raw)
  To: kvm, nik.borisov, linux-kernel
  Cc: Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, kirill.shutemov, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Yamahata, Isaku, Chatre,
	Reinette, hpa, peterz, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, x86,
	Williams, Dan J

On Wed, 2023-06-28 at 12:20 +0300, Nikolay Borisov wrote:
> > +	atomic_inc_return(&tdx_may_has_private_mem);
> > +
> >    	/* Config the key of global KeyID on all packages */
> >    	ret = config_global_keyid();
> >    	if (ret)
> > @@ -1154,6 +1167,15 @@ static int init_tdx_module(void)
> >    	 * as suggested by the TDX spec.
> >    	 */
> >    	tdmrs_reset_pamt_all(&tdx_tdmr_list);
> > +	/*
> > +	 * No more TDX private pages now, and PAMTs/TDMRs are
> > +	 * going to be freed.  Make this globally visible so
> > +	 * tdx_reset_memory() can read stable TDMRs/PAMTs.
> > +	 *
> > +	 * Note atomic_dec_return(), which is an atomic RMW with
> > +	 * return value, always enforces the memory barrier.
> > +	 */
> > +	atomic_dec_return(&tdx_may_has_private_mem);
> 
> Make a comment here which either refers to the comment at the increment 
> site.

I guess I got your point.  Will try to make better comments.

> 
> >    out_free_pamts:
> >    	tdmrs_free_pamt_all(&tdx_tdmr_list);
> >    out_free_tdmrs:
> > @@ -1229,6 +1251,63 @@ int tdx_enable(void)
> >    }
> >    EXPORT_SYMBOL_GPL(tdx_enable);
> >    
> > +/*
> > + * Convert TDX private pages back to normal on platforms with
> > + * "partial write machine check" erratum.
> > + *
> > + * Called from machine_kexec() before booting to the new kernel.
> > + */
> > +void tdx_reset_memory(void)
> > +{
> > +	if (!platform_tdx_enabled())
> > +		return;
> > +
> > +	/*
> > +	 * Kernel read/write to TDX private memory doesn't
> > +	 * cause machine check on hardware w/o this erratum.
> > +	 */
> > +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> > +		return;
> > +
> > +	/* Called from kexec() when only rebooting cpu is alive */
> > +	WARN_ON_ONCE(num_online_cpus() != 1);
> > +
> > +	if (!atomic_read(&tdx_may_has_private_mem))
> > +		return;
> 
> I think a comment is warranted here explicitly calling our the ordering 
> requirement/guarantees. Actually this is a non-rmw operation so it 
> doesn't have any bearing on the ordering/implicit mb's achieved at the 
> "increment" site.

We don't need explicit ordering/barrier here, if I am not missing something. 
The atomic_{inc/dec}_return() already made sure the memory ordering -- which
guarantees when @tdx_may_has_private_mem reads true _here_, the TDMRs/PAMTs must
be stable.

Quoted from Documentation/atomic_t.txt:

"
 - RMW operations that have a return value are fully ordered;   

 ...

Fully ordered primitives are ordered against everything prior and everything   
subsequent. Therefore a fully ordered primitive is like having an smp_mb()     
before and an smp_mb() after the primitive.
"


Am I missing anything? 

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-28 23:21     ` Huang, Kai
@ 2023-06-29  3:40       ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-29  3:40 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, Hansen, Dave, ak,
	Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, tglx, linux-kernel, linux-mm, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On Wed, 2023-06-28 at 23:21 +0000, Huang, Kai wrote:
> > > +	/* Need a stable CPU id for printing error message */
> > > +	cpu = get_cpu();
> > 
> > And that's important because? 
> > 
> 
> I want to have a stable cpu for error message printing.

Sorry misunderstood your question.

I think having the CPU id on which the SEAMCALL failed in the dmesg would be
better?  But it's not absolutely needed.  I can remove it (thus remove
{get|put}_cpu()) if you prefer not to print?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum
  2023-06-29  3:19     ` Huang, Kai
@ 2023-06-29  5:38       ` Huang, Kai
  2023-06-29  9:45         ` Huang, Kai
  0 siblings, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-29  5:38 UTC (permalink / raw)
  To: kvm, nik.borisov, linux-kernel
  Cc: Williams, Dan J, Raj, Ashok, Hansen, Dave, david, bagasdotme, ak,
	Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, Yamahata, Isaku, tglx, Luck, Tony, hpa,
	peterz, Shahar, Sagi, imammedo, linux-mm, bp, Brown, Len, Gao,
	Chao, sathyanarayanan.kuppuswamy, Huang, Ying, x86

On Thu, 2023-06-29 at 03:19 +0000, Huang, Kai wrote:
> On Wed, 2023-06-28 at 12:20 +0300, Nikolay Borisov wrote:
> > > +	atomic_inc_return(&tdx_may_has_private_mem);
> > > +
> > >    	/* Config the key of global KeyID on all packages */
> > >    	ret = config_global_keyid();
> > >    	if (ret)
> > > @@ -1154,6 +1167,15 @@ static int init_tdx_module(void)
> > >    	 * as suggested by the TDX spec.
> > >    	 */
> > >    	tdmrs_reset_pamt_all(&tdx_tdmr_list);
> > > +	/*
> > > +	 * No more TDX private pages now, and PAMTs/TDMRs are
> > > +	 * going to be freed.  Make this globally visible so
> > > +	 * tdx_reset_memory() can read stable TDMRs/PAMTs.
> > > +	 *
> > > +	 * Note atomic_dec_return(), which is an atomic RMW with
> > > +	 * return value, always enforces the memory barrier.
> > > +	 */
> > > +	atomic_dec_return(&tdx_may_has_private_mem);
> > 
> > Make a comment here which either refers to the comment at the increment 
> > site.
> 
> I guess I got your point.  Will try to make better comments.
> 
> > 
> > >    out_free_pamts:
> > >    	tdmrs_free_pamt_all(&tdx_tdmr_list);
> > >    out_free_tdmrs:
> > > @@ -1229,6 +1251,63 @@ int tdx_enable(void)
> > >    }
> > >    EXPORT_SYMBOL_GPL(tdx_enable);
> > >    
> > > +/*
> > > + * Convert TDX private pages back to normal on platforms with
> > > + * "partial write machine check" erratum.
> > > + *
> > > + * Called from machine_kexec() before booting to the new kernel.
> > > + */
> > > +void tdx_reset_memory(void)
> > > +{
> > > +	if (!platform_tdx_enabled())
> > > +		return;
> > > +
> > > +	/*
> > > +	 * Kernel read/write to TDX private memory doesn't
> > > +	 * cause machine check on hardware w/o this erratum.
> > > +	 */
> > > +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> > > +		return;
> > > +
> > > +	/* Called from kexec() when only rebooting cpu is alive */
> > > +	WARN_ON_ONCE(num_online_cpus() != 1);
> > > +
> > > +	if (!atomic_read(&tdx_may_has_private_mem))
> > > +		return;
> > 
> > I think a comment is warranted here explicitly calling our the ordering 
> > requirement/guarantees. Actually this is a non-rmw operation so it 
> > doesn't have any bearing on the ordering/implicit mb's achieved at the 
> > "increment" site.
> 
> We don't need explicit ordering/barrier here, if I am not missing something. 
> The atomic_{inc/dec}_return() already made sure the memory ordering -- which
> guarantees when @tdx_may_has_private_mem reads true _here_, the TDMRs/PAMTs must
> be stable.
> 
> Quoted from Documentation/atomic_t.txt:
> 
> "
>  - RMW operations that have a return value are fully ordered;   
> 
>  ...
> 
> Fully ordered primitives are ordered against everything prior and everything   
> subsequent. Therefore a fully ordered primitive is like having an smp_mb()     
> before and an smp_mb() after the primitive.
> "
> 
> 
> Am I missing anything? 

OK I guess I figured out by myself after more thinking.  Although the
atomic_{inc|dec}_return() code path has guaranteed when @tdx_may_has_private_mem
is true, TDMRs/PAMTs are stable, but here in the reading path, the code below

	tdmrs_reset_pamt_all(&tdx_tdmr_list);

may still be executed speculatively before the if () statement completes

	if (!atomic_read(&tdx_may_has_private_mem))
		return;

So we need CPU memory barrier instead of compiler barrier.

I'll change to use "RMW with return value" to provide the CPU barrier.

Thanks for the feedback!

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2023-06-28 14:10   ` Peter Zijlstra
@ 2023-06-29  9:15     ` Huang, Kai
  2023-06-30  9:34       ` Peter Zijlstra
  0 siblings, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-29  9:15 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Wed, 2023-06-28 at 16:10 +0200, Peter Zijlstra wrote:
> On Tue, Jun 27, 2023 at 02:12:38AM +1200, Kai Huang wrote:
> > +static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
> > +			   struct cmr_info *cmr_array)
> > +{
> > +	struct tdx_module_output out;
> > +	u64 sysinfo_pa, cmr_array_pa;
> > +	int ret;
> > +
> > +	sysinfo_pa = __pa(sysinfo);
> > +	cmr_array_pa = __pa(cmr_array);
> > +	ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
> > +			cmr_array_pa, MAX_CMRS, NULL, &out);
> > +	if (ret)
> > +		return ret;
> > +
> > +	pr_info("TDX module: attributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
> > +		sysinfo->attributes,	sysinfo->vendor_id,
> > +		sysinfo->major_version, sysinfo->minor_version,
> > +		sysinfo->build_date,	sysinfo->build_num);
> > +
> > +	/* R9 contains the actual entries written to the CMR array. */
> 
> So I'm vexed by this comment; it's either not enough or too much.
> 
> I mean, as given you assume we all know about the magic parameters to
> TDH_SYS_INFO but then somehow need an explanation for how %r9 is changed
> from the array size to the number of used entries.
> 
> Either describe the whole thing or none of it.
> 
> Me, I would prefer all of it, because I've no idea where to begin
> looking for any of this, 
> 

Sure.  How about below?

+       /*
+        * TDH.SYS.INFO writes the TDSYSINFO_STRUCT and the CMR array
+        * to the buffers provided by the kernel (via RCX and R8
+        * respectively).  The buffer size of the TDSYSINFO_STRUCT
+        * (via RDX) and the maximum entries of the CMR array (via R9)
+        * passed to this SEAMCALL must be at least the size of
+        * TDSYSINFO_STRUCT and MAX_CMRS respectively.
+        *
+        * Upon a successful return, R9 contains the actual entries
+        * written to the CMR array.
+        */
        sysinfo_pa = __pa(sysinfo);
        cmr_array_pa = __pa(cmr_array);
        ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
@@ -228,7 +239,6 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
                sysinfo->major_version, sysinfo->minor_version,
                sysinfo->build_date,    sysinfo->build_num);
 
-       /* R9 contains the actual entries written to the CMR array. */
        print_cmrs(cmr_array, out.r9);

Or should I just repeat the spec like below?

+       /*
+        * TDH.SYS.INFO writes the TDSYSINFO_STRUCT and the CMR array
+        * to the buffers provided by the kernel:
+        *
+        * Input:
+        *  - RCX: The buffer of TDSYSINFO_STRUCT
+        *  - RDX: The size of the TDSYSINFO_STRUCT buffer, must be at
+        *         at least the size of TDSYSINFO_STRUCT
+        *  - R8: The buffer of the CMR array
+        *  - R9: The entry number of the array, must be at least
+        *        MAX_CMRS.
+        *
+        * Output (successful):
+        *  - RDX: The actual bytes written to the TDSYSINFO_STRUCT
+        *         buffer
+        *  - R9: The actual entries written to the CMR array.
+        */
        sysinfo_pa = __pa(sysinfo);
        cmr_array_pa = __pa(cmr_array);
        ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
@@ -228,7 +245,6 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
                sysinfo->major_version, sysinfo->minor_version,
                sysinfo->build_date,    sysinfo->build_num);
 
-       /* R9 contains the actual entries written to the CMR array. */
        print_cmrs(cmr_array, out.r9);

> SDM doesn't seem to be the place. That doesn't
> even list TDCALL/SEAMCALL in Volume 2 :-( Let alone describe the magic
> values.
> 

TDX has it's own specs at here:

https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html

For this one you can find it in here:

https://cdrdv2.intel.com/v1/dl/getContent/733568



^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum
  2023-06-29  5:38       ` Huang, Kai
@ 2023-06-29  9:45         ` Huang, Kai
  2023-06-29  9:48           ` Nikolay Borisov
  0 siblings, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-29  9:45 UTC (permalink / raw)
  To: kvm, nik.borisov, linux-kernel
  Cc: Huang, Ying, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck,
	Tony, ak, Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, Yamahata, Isaku, tglx, hpa, peterz,
	Shahar, Sagi, imammedo, linux-mm, bp, Brown, Len, Gao, Chao,
	sathyanarayanan.kuppuswamy, Williams, Dan J, x86

On Thu, 2023-06-29 at 05:38 +0000, Huang, Kai wrote:
> On Thu, 2023-06-29 at 03:19 +0000, Huang, Kai wrote:
> > On Wed, 2023-06-28 at 12:20 +0300, Nikolay Borisov wrote:
> > > > +	atomic_inc_return(&tdx_may_has_private_mem);
> > > > +
> > > >    	/* Config the key of global KeyID on all packages */
> > > >    	ret = config_global_keyid();
> > > >    	if (ret)
> > > > @@ -1154,6 +1167,15 @@ static int init_tdx_module(void)
> > > >    	 * as suggested by the TDX spec.
> > > >    	 */
> > > >    	tdmrs_reset_pamt_all(&tdx_tdmr_list);
> > > > +	/*
> > > > +	 * No more TDX private pages now, and PAMTs/TDMRs are
> > > > +	 * going to be freed.  Make this globally visible so
> > > > +	 * tdx_reset_memory() can read stable TDMRs/PAMTs.
> > > > +	 *
> > > > +	 * Note atomic_dec_return(), which is an atomic RMW with
> > > > +	 * return value, always enforces the memory barrier.
> > > > +	 */
> > > > +	atomic_dec_return(&tdx_may_has_private_mem);
> > > 
> > > Make a comment here which either refers to the comment at the increment 
> > > site.
> > 
> > I guess I got your point.  Will try to make better comments.
> > 
> > > 
> > > >    out_free_pamts:
> > > >    	tdmrs_free_pamt_all(&tdx_tdmr_list);
> > > >    out_free_tdmrs:
> > > > @@ -1229,6 +1251,63 @@ int tdx_enable(void)
> > > >    }
> > > >    EXPORT_SYMBOL_GPL(tdx_enable);
> > > >    
> > > > +/*
> > > > + * Convert TDX private pages back to normal on platforms with
> > > > + * "partial write machine check" erratum.
> > > > + *
> > > > + * Called from machine_kexec() before booting to the new kernel.
> > > > + */
> > > > +void tdx_reset_memory(void)
> > > > +{
> > > > +	if (!platform_tdx_enabled())
> > > > +		return;
> > > > +
> > > > +	/*
> > > > +	 * Kernel read/write to TDX private memory doesn't
> > > > +	 * cause machine check on hardware w/o this erratum.
> > > > +	 */
> > > > +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> > > > +		return;
> > > > +
> > > > +	/* Called from kexec() when only rebooting cpu is alive */
> > > > +	WARN_ON_ONCE(num_online_cpus() != 1);
> > > > +
> > > > +	if (!atomic_read(&tdx_may_has_private_mem))
> > > > +		return;
> > > 
> > > I think a comment is warranted here explicitly calling our the ordering 
> > > requirement/guarantees. Actually this is a non-rmw operation so it 
> > > doesn't have any bearing on the ordering/implicit mb's achieved at the 
> > > "increment" site.
> > 
> > We don't need explicit ordering/barrier here, if I am not missing something. 
> > The atomic_{inc/dec}_return() already made sure the memory ordering -- which
> > guarantees when @tdx_may_has_private_mem reads true _here_, the TDMRs/PAMTs must
> > be stable.
> > 
> > Quoted from Documentation/atomic_t.txt:
> > 
> > "
> >  - RMW operations that have a return value are fully ordered;   
> > 
> >  ...
> > 
> > Fully ordered primitives are ordered against everything prior and everything   
> > subsequent. Therefore a fully ordered primitive is like having an smp_mb()     
> > before and an smp_mb() after the primitive.
> > "
> > 
> > 
> > Am I missing anything? 
> 
> OK I guess I figured out by myself after more thinking.  Although the
> atomic_{inc|dec}_return() code path has guaranteed when @tdx_may_has_private_mem
> is true, TDMRs/PAMTs are stable, but here in the reading path, the code below
> 
> 	tdmrs_reset_pamt_all(&tdx_tdmr_list);
> 
> may still be executed speculatively before the if () statement completes
> 
> 	if (!atomic_read(&tdx_may_has_private_mem))
> 		return;
> 
> So we need CPU memory barrier instead of compiler barrier.
> 

(Sorry for multiple replies)

Hmm.. reading the SDM more carefully, the speculative execution shouldn't
matter.  It may cause instruction/data being fetched to the cache, etc, but the
instruction shouldn't take effort unless the above branch predication truly
turns out to be the right result.

What matters is memory reads/writes order.  On x86, per SDM on single processor
(which is the case here) basically reads/writes are not reordered:

"
In a single-processor system for memory regions defined as write-back cacheable,
the memory-ordering model respects the following principles ...:
• Reads are not reordered with other reads.
• Writes are not reordered with older reads.
• Writes to memory are not reordered with other writes, with the following
  exceptions:
  (string operations/non-temporal moves)
...
"

So in practice there should be no problem.  But I will just use the correct
atomic_t operations to force a memory barrier here.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum
  2023-06-29  9:45         ` Huang, Kai
@ 2023-06-29  9:48           ` Nikolay Borisov
  0 siblings, 0 replies; 159+ messages in thread
From: Nikolay Borisov @ 2023-06-29  9:48 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Huang, Ying, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck,
	Tony, ak, Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, Yamahata, Isaku, tglx, hpa, peterz,
	Shahar, Sagi, imammedo, linux-mm, bp, Brown, Len, Gao, Chao,
	sathyanarayanan.kuppuswamy, Williams, Dan J, x86



On 29.06.23 г. 12:45 ч., Huang, Kai wrote:
> On Thu, 2023-06-29 at 05:38 +0000, Huang, Kai wrote:
>> On Thu, 2023-06-29 at 03:19 +0000, Huang, Kai wrote:
>>> On Wed, 2023-06-28 at 12:20 +0300, Nikolay Borisov wrote:
>>>>> +	atomic_inc_return(&tdx_may_has_private_mem);
>>>>> +
>>>>>     	/* Config the key of global KeyID on all packages */
>>>>>     	ret = config_global_keyid();
>>>>>     	if (ret)
>>>>> @@ -1154,6 +1167,15 @@ static int init_tdx_module(void)
>>>>>     	 * as suggested by the TDX spec.
>>>>>     	 */
>>>>>     	tdmrs_reset_pamt_all(&tdx_tdmr_list);
>>>>> +	/*
>>>>> +	 * No more TDX private pages now, and PAMTs/TDMRs are
>>>>> +	 * going to be freed.  Make this globally visible so
>>>>> +	 * tdx_reset_memory() can read stable TDMRs/PAMTs.
>>>>> +	 *
>>>>> +	 * Note atomic_dec_return(), which is an atomic RMW with
>>>>> +	 * return value, always enforces the memory barrier.
>>>>> +	 */
>>>>> +	atomic_dec_return(&tdx_may_has_private_mem);
>>>>
>>>> Make a comment here which either refers to the comment at the increment
>>>> site.
>>>
>>> I guess I got your point.  Will try to make better comments.
>>>
>>>>
>>>>>     out_free_pamts:
>>>>>     	tdmrs_free_pamt_all(&tdx_tdmr_list);
>>>>>     out_free_tdmrs:
>>>>> @@ -1229,6 +1251,63 @@ int tdx_enable(void)
>>>>>     }
>>>>>     EXPORT_SYMBOL_GPL(tdx_enable);
>>>>>     
>>>>> +/*
>>>>> + * Convert TDX private pages back to normal on platforms with
>>>>> + * "partial write machine check" erratum.
>>>>> + *
>>>>> + * Called from machine_kexec() before booting to the new kernel.
>>>>> + */
>>>>> +void tdx_reset_memory(void)
>>>>> +{
>>>>> +	if (!platform_tdx_enabled())
>>>>> +		return;
>>>>> +
>>>>> +	/*
>>>>> +	 * Kernel read/write to TDX private memory doesn't
>>>>> +	 * cause machine check on hardware w/o this erratum.
>>>>> +	 */
>>>>> +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
>>>>> +		return;
>>>>> +
>>>>> +	/* Called from kexec() when only rebooting cpu is alive */
>>>>> +	WARN_ON_ONCE(num_online_cpus() != 1);
>>>>> +
>>>>> +	if (!atomic_read(&tdx_may_has_private_mem))
>>>>> +		return;
>>>>
>>>> I think a comment is warranted here explicitly calling our the ordering
>>>> requirement/guarantees. Actually this is a non-rmw operation so it
>>>> doesn't have any bearing on the ordering/implicit mb's achieved at the
>>>> "increment" site.
>>>
>>> We don't need explicit ordering/barrier here, if I am not missing something.
>>> The atomic_{inc/dec}_return() already made sure the memory ordering -- which
>>> guarantees when @tdx_may_has_private_mem reads true _here_, the TDMRs/PAMTs must
>>> be stable.
>>>
>>> Quoted from Documentation/atomic_t.txt:
>>>
>>> "
>>>   - RMW operations that have a return value are fully ordered;
>>>
>>>   ...
>>>
>>> Fully ordered primitives are ordered against everything prior and everything
>>> subsequent. Therefore a fully ordered primitive is like having an smp_mb()
>>> before and an smp_mb() after the primitive.
>>> "
>>>
>>>
>>> Am I missing anything?
>>
>> OK I guess I figured out by myself after more thinking.  Although the
>> atomic_{inc|dec}_return() code path has guaranteed when @tdx_may_has_private_mem
>> is true, TDMRs/PAMTs are stable, but here in the reading path, the code below
>>
>> 	tdmrs_reset_pamt_all(&tdx_tdmr_list);
>>
>> may still be executed speculatively before the if () statement completes
>>
>> 	if (!atomic_read(&tdx_may_has_private_mem))
>> 		return;
>>
>> So we need CPU memory barrier instead of compiler barrier.
>>
> 
> (Sorry for multiple replies)
> 
> Hmm.. reading the SDM more carefully, the speculative execution shouldn't
> matter.  It may cause instruction/data being fetched to the cache, etc, but the
> instruction shouldn't take effort unless the above branch predication truly
> turns out to be the right result.
> 
> What matters is memory reads/writes order.  On x86, per SDM on single processor
> (which is the case here) basically reads/writes are not reordered:
> 
> "
> In a single-processor system for memory regions defined as write-back cacheable,
> the memory-ordering model respects the following principles ...:
> • Reads are not reordered with other reads.
> • Writes are not reordered with older reads.
> • Writes to memory are not reordered with other writes, with the following
>    exceptions:
>    (string operations/non-temporal moves)
> ...
> "
> 
> So in practice there should be no problem.  But I will just use the correct
> atomic_t operations to force a memory barrier here.


as-per memory-barriers.txt there needs to be proper comment explaining a 
particular ordering scenario. Let's not forget that this code will have 
to be maintained for years to come not necessarily by you and the poor 
sod that comes after you should be provided all the help in terms of 
context to understand the code :)

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-28 15:29   ` Peter Zijlstra
  2023-06-28 20:38     ` Peter Zijlstra
@ 2023-06-29 10:00     ` Huang, Kai
  1 sibling, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-29 10:00 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Wed, 2023-06-28 at 17:29 +0200, Peter Zijlstra wrote:
> On Tue, Jun 27, 2023 at 02:12:50AM +1200, Kai Huang wrote:
> > diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
> > index 49a54356ae99..757b0c34be10 100644
> > --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> > +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> > @@ -1,6 +1,7 @@
> >  /* SPDX-License-Identifier: GPL-2.0 */
> >  #include <asm/asm-offsets.h>
> >  #include <asm/tdx.h>
> > +#include <asm/asm.h>
> >  
> >  /*
> >   * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
> > @@ -45,6 +46,7 @@
> >  	/* Leave input param 2 in RDX */
> >  
> >  	.if \host
> > +1:
> >  	seamcall
> 
> So what registers are actually clobbered by SEAMCALL ? There's a
> distinct lack of it in SDM Vol.2 instruction list :-(
> 
> >  	/*
> >  	 * SEAMCALL instruction is essentially a VMExit from VMX root
> > @@ -57,10 +59,23 @@
> >  	 * This value will never be used as actual SEAMCALL error code as
> >  	 * it is from the Reserved status code class.
> >  	 */
> > -	jnc .Lno_vmfailinvalid
> > +	jnc .Lseamcall_out
> >  	mov $TDX_SEAMCALL_VMFAILINVALID, %rax
> > -.Lno_vmfailinvalid:
> > +	jmp .Lseamcall_out
> > +2:
> > +	/*
> > +	 * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
> > +	 * the trap number.  Convert the trap number to the TDX error
> > +	 * code by setting TDX_SW_ERROR to the high 32-bits of %rax.
> > +	 *
> > +	 * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction
> > +	 * only accepts 32-bit immediate at most.
> > +	 */
> > +	mov $TDX_SW_ERROR, %r12
> > +	orq %r12, %rax
> >  
> > +	_ASM_EXTABLE_FAULT(1b, 2b)
> > +.Lseamcall_out:
> 
> This is all pretty atrocious code flow... would it at all be possible to
> write it like:
> 
> SYM_FUNC_START(...)
> 
> .if \host
> 1:	seamcall
> 	cmovc	%spare, %rax

Looks using cmovc can remove the using of one additional label.

I guess it can be done in a separate patch.

> 2:
> .else
> 	tdcall
> .endif
> 
> 	.....
> 	RET
> 
> 
> 3:
> 	mov $TDX_SW_ERROR, %r12
> 	orq %r12, %rax
> 	jmp 2b
> 
> 	_ASM_EXTABLE_FAULT(1b, 3b)
> 
> SYM_FUNC_END()
> 
> That is, having all that inline in the hotpath is quite horrific.
> 

The __tdx_module_call() and __seamcall() both has a "RET" at the end of this
macro:

SYM_FUNC_START(__tdx_module_call)       
        FRAME_BEGIN
        TDX_MODULE_CALL host=0
        FRAME_END
        RET                                                       
SYM_FUNC_END(__tdx_module_call)       

In this case, I think we need to remove the "RET" there.

I didn't think I would modify __tdx_module_call() and __seamcall() in this
patch.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-28 13:54     ` Peter Zijlstra
  2023-06-28 23:25       ` Huang, Kai
@ 2023-06-29 10:15       ` kirill.shutemov
  1 sibling, 0 replies; 159+ messages in thread
From: kirill.shutemov @ 2023-06-29 10:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kai Huang, linux-kernel, kvm, linux-mm, x86, dave.hansen,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Wed, Jun 28, 2023 at 03:54:36PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 28, 2023 at 02:58:13PM +0200, Peter Zijlstra wrote:
> 
> > Can someone explain to me why __tdx_hypercall() is sane (per the above)
> > but then we grew __tdx_module_call() as an absolute abomination and are
> > apparently using that for seam too?
> 
> That is, why do we have two different TDCALL wrappers? Makes no sense.

__tdx_module_call() is the wrapper for TDCALL.

__tdx_hypercall() is the wrapper for TDG.VP.VMCALL leaf function of
TDCALL. The function is used often and it uses wider range or registers
comparing to the rest of the TDCALL functions.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-28 20:38     ` Peter Zijlstra
  2023-06-28 21:11       ` Peter Zijlstra
@ 2023-06-29 10:33       ` Huang, Kai
  2023-06-30 10:06         ` Peter Zijlstra
  2023-06-29 11:16       ` kirill.shutemov
  2 siblings, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-29 10:33 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Wed, 2023-06-28 at 22:38 +0200, Peter Zijlstra wrote:
> On Wed, Jun 28, 2023 at 05:29:01PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 27, 2023 at 02:12:50AM +1200, Kai Huang wrote:
> > > diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
> > > index 49a54356ae99..757b0c34be10 100644
> > > --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> > > +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> > > @@ -1,6 +1,7 @@
> > >  /* SPDX-License-Identifier: GPL-2.0 */
> > >  #include <asm/asm-offsets.h>
> > >  #include <asm/tdx.h>
> > > +#include <asm/asm.h>
> > >  
> > >  /*
> > >   * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
> > > @@ -45,6 +46,7 @@
> > >  	/* Leave input param 2 in RDX */
> > >  
> > >  	.if \host
> > > +1:
> > >  	seamcall
> > 
> > So what registers are actually clobbered by SEAMCALL ? There's a
> > distinct lack of it in SDM Vol.2 instruction list :-(
> 
> With the exception of the abomination that is TDH.VP.ENTER all SEAMCALLs
> seem to be limited to the set presented here (c,d,8,9,10,11) and all
> other registers should be available.

RAX is also used as SEAMCALL return code.

Looking at the later versions of TDX spec (with TD live migration, etc), it
seems they are already using R12-R13 as SEAMCALL output:

https://cdrdv2.intel.com/v1/dl/getContent/733579

E.g., 6.3.15. NEW: TDH.IMPORT.MEM Leaf

It uses R12 and R13 as input.

> 
> Can we please make that a hard requirement, SEAMCALL must not use
> registers outside this? We can hardly program to random future
> extentions; we need hard ABI guarantees here.


I believe all other GPRs are just saved/restored in SEAMCALL/SEAMRET, so in
practice all other GPRs not used as input/output should not be clobbered.  But I
will confirm with TDX module guys.  And even it's true in practice it's better
to document it.  

But I think we also want to ask them to stop adding more registers as
input/output.

I'll talk to TDX module team on this.

> 
> That also means we should be able to use si,di for the cmovc below.
> 
> Kirill, back when we did __tdx_hypercall() we got bp removed as a valid
> register, the 1.0 spec still lists that, and it is also listed in
> TDH.VP.ENTER, I'm assuming it will be removed there too?
> 
> bp must not be used -- it violates the pre-existing calling convention.
> 
> 


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-28 20:38     ` Peter Zijlstra
  2023-06-28 21:11       ` Peter Zijlstra
  2023-06-29 10:33       ` Huang, Kai
@ 2023-06-29 11:16       ` kirill.shutemov
  2 siblings, 0 replies; 159+ messages in thread
From: kirill.shutemov @ 2023-06-29 11:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kai Huang, linux-kernel, kvm, linux-mm, x86, dave.hansen,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Wed, Jun 28, 2023 at 10:38:23PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 28, 2023 at 05:29:01PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 27, 2023 at 02:12:50AM +1200, Kai Huang wrote:
> > > diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
> > > index 49a54356ae99..757b0c34be10 100644
> > > --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> > > +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> > > @@ -1,6 +1,7 @@
> > >  /* SPDX-License-Identifier: GPL-2.0 */
> > >  #include <asm/asm-offsets.h>
> > >  #include <asm/tdx.h>
> > > +#include <asm/asm.h>
> > >  
> > >  /*
> > >   * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
> > > @@ -45,6 +46,7 @@
> > >  	/* Leave input param 2 in RDX */
> > >  
> > >  	.if \host
> > > +1:
> > >  	seamcall
> > 
> > So what registers are actually clobbered by SEAMCALL ? There's a
> > distinct lack of it in SDM Vol.2 instruction list :-(
> 
> With the exception of the abomination that is TDH.VP.ENTER all SEAMCALLs
> seem to be limited to the set presented here (c,d,8,9,10,11) and all
> other registers should be available.
> 
> Can we please make that a hard requirement, SEAMCALL must not use
> registers outside this? We can hardly program to random future
> extentions; we need hard ABI guarantees here.
> 
> That also means we should be able to use si,di for the cmovc below.
> 
> Kirill, back when we did __tdx_hypercall() we got bp removed as a valid
> register, the 1.0 spec still lists that, and it is also listed in
> TDH.VP.ENTER, I'm assuming it will be removed there too?
> 
> bp must not be used -- it violates the pre-existing calling convention.

I've just brought it up again internally. Let's see what will happen.
 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 04/22] x86/cpu: Detect TDX partial write machine check erratum
  2023-06-26 14:12 ` [PATCH v12 04/22] x86/cpu: Detect TDX partial write machine check erratum Kai Huang
@ 2023-06-29 11:22   ` David Hildenbrand
  0 siblings, 0 replies; 159+ messages in thread
From: David Hildenbrand @ 2023-06-29 11:22 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo

On 26.06.23 16:12, Kai Huang wrote:
> TDX memory has integrity and confidentiality protections.  Violations of
> this integrity protection are supposed to only affect TDX operations and
> are never supposed to affect the host kernel itself.  In other words,
> the host kernel should never, itself, see machine checks induced by the
> TDX integrity hardware.
> 
> Alas, the first few generations of TDX hardware have an erratum.  A
> partial write to a TDX private memory cacheline will silently "poison"
> the line.  Subsequent reads will consume the poison and generate a
> machine check.  According to the TDX hardware spec, neither of these
> things should have happened.
> 
> Virtually all kernel memory accesses operations happen in full
> cachelines.  In practice, writing a "byte" of memory usually reads a 64
> byte cacheline of memory, modifies it, then writes the whole line back.
> Those operations do not trigger this problem.
> 
> This problem is triggered by "partial" writes where a write transaction
> of less than cacheline lands at the memory controller.  The CPU does
> these via non-temporal write instructions (like MOVNTI), or through
> UC/WC memory mappings.  The issue can also be triggered away from the
> CPU by devices doing partial writes via DMA.
> 
> With this erratum, there are additional things need to be done.  Similar
> to other CPU bugs, use a CPU bug bit to indicate this erratum, and
> detect this erratum during early boot.  Note this bug reflects the
> hardware thus it is detected regardless of whether the kernel is built
> with TDX support or not.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure
  2023-06-28  3:34     ` Huang, Kai
  2023-06-28 11:50       ` kirill.shutemov
@ 2023-06-29 11:25       ` David Hildenbrand
  1 sibling, 0 replies; 159+ messages in thread
From: David Hildenbrand @ 2023-06-29 11:25 UTC (permalink / raw)
  To: Huang, Kai, Gao, Chao
  Cc: kvm, Raj, Ashok, Hansen, Dave, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, peterz, Shahar,
	Sagi, imammedo, bp, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

>> then the code becomes self-explanatory. i.e., you can drop the comment.
> 
> If using this, I ended up with below:
> 
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -23,6 +23,8 @@
>   #define TDX_SEAMCALL_GP                        (TDX_SW_ERROR | X86_TRAP_GP)
>   #define TDX_SEAMCALL_UD                        (TDX_SW_ERROR | X86_TRAP_UD)
>   
> +#define TDX_SUCCESS           0
> +
> 
> Hi Kirill/Dave/David,
> 
> Are you happy with this?

Yes, all sounds good to me!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-26 14:12 ` [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
                     ` (4 preceding siblings ...)
  2023-06-28 13:17   ` Peter Zijlstra
@ 2023-06-29 11:31   ` David Hildenbrand
  2023-06-29 22:58     ` Huang, Kai
  5 siblings, 1 reply; 159+ messages in thread
From: David Hildenbrand @ 2023-06-29 11:31 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo

On 26.06.23 16:12, Kai Huang wrote:
> To enable TDX the kernel needs to initialize TDX from two perspectives:
> 1) Do a set of SEAMCALLs to initialize the TDX module to make it ready
> to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL
> on one logical cpu before the kernel wants to make any other SEAMCALLs
> on that cpu (including those involved during module initialization and
> running TDX guests).
> 
> The TDX module can be initialized only once in its lifetime.  Instead
> of always initializing it at boot time, this implementation chooses an
> "on demand" approach to initialize TDX until there is a real need (e.g
> when requested by KVM).  This approach has below pros:
> 
> 1) It avoids consuming the memory that must be allocated by kernel and
> given to the TDX module as metadata (~1/256th of the TDX-usable memory),
> and also saves the CPU cycles of initializing the TDX module (and the
> metadata) when TDX is not used at all.
> 
> 2) The TDX module design allows it to be updated while the system is
> running.  The update procedure shares quite a few steps with this "on
> demand" initialization mechanism.  The hope is that much of "on demand"
> mechanism can be shared with a future "update" mechanism.  A boot-time
> TDX module implementation would not be able to share much code with the
> update mechanism.
> 
> 3) Making SEAMCALL requires VMX to be enabled.  Currently, only the KVM
> code mucks with VMX enabling.  If the TDX module were to be initialized
> separately from KVM (like at boot), the boot code would need to be
> taught how to muck with VMX enabling and KVM would need to be taught how
> to cope with that.  Making KVM itself responsible for TDX initialization
> lets the rest of the kernel stay blissfully unaware of VMX.
> 
> Similar to module initialization, also make the per-cpu initialization
> "on demand" as it also depends on VMX being enabled.
> 
> Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX
> module and enable TDX on local cpu respectively.  For now tdx_enable()
> is a placeholder.  The TODO list will be pared down as functionality is
> added.
> 
> Export both tdx_cpu_enable() and tdx_enable() for KVM use.
> 
> In tdx_enable() use a state machine protected by mutex to make sure the
> initialization will only be done once, as tdx_enable() can be called
> multiple times (i.e. KVM module can be reloaded) and may be called
> concurrently by other kernel components in the future.
> 
> The per-cpu initialization on each cpu can only be done once during the
> module's life time.  Use a per-cpu variable to track its status to make
> sure it is only done once in tdx_cpu_enable().
> 
> Also, a SEAMCALL to do TDX module global initialization must be done
> once on any logical cpu before any per-cpu initialization SEAMCALL.  Do
> it inside tdx_cpu_enable() too (if hasn't been done).
> 
> tdx_enable() can potentially invoke SEAMCALLs on any online cpus.  The
> per-cpu initialization must be done before those SEAMCALLs are invoked
> on some cpu.  To keep things simple, in tdx_cpu_enable(), always do the
> per-cpu initialization regardless of whether the TDX module has been
> initialized or not.  And in tdx_enable(), don't call tdx_cpu_enable()
> but assume the caller has disabled CPU hotplug, done VMXON and
> tdx_cpu_enable() on all online cpus before calling tdx_enable().
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
> 
> v11 -> v12:
>   - Simplified TDX module global init and lp init status tracking (David).
>   - Added comment around try_init_module_global() for using
>     raw_spin_lock() (Dave).
>   - Added one sentence to changelog to explain why to expose tdx_enable()
>     and tdx_cpu_enable() (Dave).
>   - Simplifed comments around tdx_enable() and tdx_cpu_enable() to use
>     lockdep_assert_*() instead. (Dave)
>   - Removed redundent "TDX" in error message (Dave).
> 
> v10 -> v11:
>   - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off.
>   - Return the actual error code for tdx_enable() instead of -EINVAL.
>   - Added Isaku's Reviewed-by.
> 
> v9 -> v10:
>   - Merged the patch to handle per-cpu initialization to this patch to
>     tell the story better.
>   - Changed how to handle the per-cpu initialization to only provide a
>     tdx_cpu_enable() function to let the user of TDX to do it when the
>     user wants to run TDX code on a certain cpu.
>   - Changed tdx_enable() to not call cpus_read_lock() explicitly, but
>     call lockdep_assert_cpus_held() to assume the caller has done that.
>   - Improved comments around tdx_enable() and tdx_cpu_enable().
>   - Improved changelog to tell the story better accordingly.
> 
> v8 -> v9:
>   - Removed detailed TODO list in the changelog (Dave).
>   - Added back steps to do module global initialization and per-cpu
>     initialization in the TODO list comment.
>   - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h
> 
> v7 -> v8:
>   - Refined changelog (Dave).
>   - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave).
>   - Add a "TODO list" comment in init_tdx_module() to list all steps of
>     initializing the TDX Module to tell the story (Dave).
>   - Made tdx_enable() unverisally return -EINVAL, and removed nonsense
>     comments (Dave).
>   - Simplified __tdx_enable() to only handle success or failure.
>   - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR
>   - Removed TDX_MODULE_NONE (not loaded) as it is not necessary.
>   - Improved comments (Dave).
>   - Pointed out 'tdx_module_status' is software thing (Dave).
> 
> v6 -> v7:
>   - No change.
> 
> v5 -> v6:
>   - Added code to set status to TDX_MODULE_NONE if TDX module is not
>     loaded (Chao)
>   - Added Chao's Reviewed-by.
>   - Improved comments around cpus_read_lock().
> 
> - v3->v5 (no feedback on v4):
>   - Removed the check that SEAMRR and TDX KeyID have been detected on
>     all present cpus.
>   - Removed tdx_detect().
>   - Added num_online_cpus() to MADT-enabled CPUs check within the CPU
>     hotplug lock and return early with error message.
>   - Improved dmesg printing for TDX module detection and initialization.
> 
> 
> ---
>   arch/x86/include/asm/tdx.h  |   4 +
>   arch/x86/virt/vmx/tdx/tdx.c | 162 ++++++++++++++++++++++++++++++++++++
>   arch/x86/virt/vmx/tdx/tdx.h |  13 +++
>   3 files changed, 179 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 4dfe2e794411..d8226a50c58c 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -97,8 +97,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>   
>   #ifdef CONFIG_INTEL_TDX_HOST
>   bool platform_tdx_enabled(void);
> +int tdx_cpu_enable(void);
> +int tdx_enable(void);
>   #else	/* !CONFIG_INTEL_TDX_HOST */
>   static inline bool platform_tdx_enabled(void) { return false; }
> +static inline int tdx_cpu_enable(void) { return -ENODEV; }
> +static inline int tdx_enable(void)  { return -ENODEV; }
>   #endif	/* CONFIG_INTEL_TDX_HOST */
>   
>   #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 141d12376c4d..29ca18f66d61 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -13,6 +13,10 @@
>   #include <linux/errno.h>
>   #include <linux/printk.h>
>   #include <linux/smp.h>
> +#include <linux/cpu.h>
> +#include <linux/spinlock.h>
> +#include <linux/percpu-defs.h>
> +#include <linux/mutex.h>
>   #include <asm/msr-index.h>
>   #include <asm/msr.h>
>   #include <asm/archrandom.h>
> @@ -23,6 +27,13 @@ static u32 tdx_global_keyid __ro_after_init;
>   static u32 tdx_guest_keyid_start __ro_after_init;
>   static u32 tdx_nr_guest_keyids __ro_after_init;
>   
> +static bool tdx_global_initialized;
> +static DEFINE_RAW_SPINLOCK(tdx_global_init_lock);
> +static DEFINE_PER_CPU(bool, tdx_lp_initialized);
> +
> +static enum tdx_module_status_t tdx_module_status;

Why can't you switch to a simple bool here as well?

It's either initialized or uninitialized. If uninitialized and you get 
an error, leave it uninitialized. The next caller will try again and 
fail again.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-29 11:31   ` David Hildenbrand
@ 2023-06-29 22:58     ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-29 22:58 UTC (permalink / raw)
  To: kvm, linux-kernel, david
  Cc: Raj, Ashok, Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki,
	Rafael J, kirill.shutemov, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, linux-mm, tglx, Yamahata, Isaku, mingo,
	nik.borisov, hpa, peterz, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86


> > +static enum tdx_module_status_t tdx_module_status;
> 
> Why can't you switch to a simple bool here as well?
> 
> It's either initialized or uninitialized. If uninitialized and you get 
> an error, leave it uninitialized. The next caller will try again and 
> fail again.
> 

We can, but in this case there might be message printed in each module
initialization call.  Let's say TDH.SYS.INFO is successful but the later
TDH.SYS.CONFIG fails.  In this case, each initialization call will print out TDX
module info and CMR info.

I think only allow initialization to be done once would be better in this case.
Apart from the message printing, it's OK to just use a simple bool.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-28 21:16         ` Peter Zijlstra
@ 2023-06-30  9:03           ` kirill.shutemov
  2023-06-30 10:02             ` Huang, Kai
  0 siblings, 1 reply; 159+ messages in thread
From: kirill.shutemov @ 2023-06-30  9:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kai Huang, linux-kernel, kvm, linux-mm, x86, dave.hansen,
	tony.luck, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Wed, Jun 28, 2023 at 11:16:41PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 28, 2023 at 11:11:32PM +0200, Peter Zijlstra wrote:
> > --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> > +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> > @@ -17,37 +17,44 @@
> >   *            TDX module and hypercalls to the VMM.
> >   * SEAMCALL - used by TDX hosts to make requests to the
> >   *            TDX module.
> > + *
> > + *-------------------------------------------------------------------------
> > + * TDCALL/SEAMCALL ABI:
> > + *-------------------------------------------------------------------------
> > + * Input Registers:
> > + *
> > + * RAX                 - TDCALL Leaf number.
> > + * RCX,RDX,R8-R9       - TDCALL Leaf specific input registers.
> > + *
> > + * Output Registers:
> > + *
> > + * RAX                 - TDCALL instruction error code.
> > + * RCX,RDX,R8-R11      - TDCALL Leaf specific output registers.
> > + *
> > + *-------------------------------------------------------------------------
> > + *
> > + * __tdx_module_call() function ABI:
> > + *
> > + * @fn   (RDI)         - TDCALL Leaf ID,    moved to RAX
> > + * @regs (RSI)         - struct tdx_regs pointer
> > + *
> > + * Return status of TDCALL via RAX.
> >   */
> > +.macro TDX_MODULE_CALL host:req ret:req
> > +	FRAME_BEGIN
> >  
> > +	mov	%rdi, %rax
> > +	mov	$TDX_SEAMCALL_VMFAILINVALID, %rdi
> >  
> > +	mov	TDX_MODULE_rcx(%rsi), %rcx
> > +	mov	TDX_MODULE_rdx(%rsi), %rdx
> > +	mov	TDX_MODULE_r8(%rsi),  %r8
> > +	mov	TDX_MODULE_r9(%rsi),  %r9
> > +//	mov	TDX_MODULE_r10(%rsi), %r10
> > +//	mov	TDX_MODULE_r11(%rsi), %r11
> >  
> > +.if \host
> > +1:	seamcall
> >  	/*
> >  	 * SEAMCALL instruction is essentially a VMExit from VMX root
> >  	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
> 	...
> >  	 * This value will never be used as actual SEAMCALL error code as
> >  	 * it is from the Reserved status code class.
> >  	 */
> > +	cmovc	%rdi, %rax
> >  2:
> > +.else
> >  	tdcall
> > +.endif
> >  
> > +.if \ret
> > +	movq %rcx, TDX_MODULE_rcx(%rsi)
> > +	movq %rdx, TDX_MODULE_rdx(%rsi)
> > +	movq %r8,  TDX_MODULE_r8(%rsi)
> > +	movq %r9,  TDX_MODULE_r9(%rsi)
> > +	movq %r10, TDX_MODULE_r10(%rsi)
> > +	movq %r11, TDX_MODULE_r11(%rsi)
> > +.endif
> > +
> > +	FRAME_END
> > +	RET
> > +
> > +.if \host
> > +3:
> > +	mov	$TDX_SW_ERROR, %rdi
> > +	or	%rdi, %rax
> > +	jmp 2b
> >  
> > +	_ASM_EXTABLE_FAULT(1b, 3b)
> > +.endif
> >  .endm
> 
> Isn't that much simpler?

I'm okay either way.

Obviously, arch/x86/coco/tdx/tdcall.S has to be patched to use the new
TDX_MODULE_CALL macro.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-29  0:15               ` Huang, Kai
@ 2023-06-30  9:22                 ` Peter Zijlstra
  2023-06-30 10:09                   ` Huang, Kai
  0 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-30  9:22 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, x86, Raj, Ashok, Hansen, Dave, david, bagasdotme, ak,
	Wysocki, Rafael J, linux-kernel, Chatre, Reinette, mingo,
	pbonzini, Christopherson,,
	Sean, Yamahata, Isaku, nik.borisov, tglx, Luck, Tony,
	kirill.shutemov, hpa, imammedo, sathyanarayanan.kuppuswamy,
	linux-mm, bp, Brown, Len, Shahar, Sagi, Huang, Ying, Williams,
	Dan J, Gao, Chao

On Thu, Jun 29, 2023 at 12:15:13AM +0000, Huang, Kai wrote:

> > 	Can be called locally or through an IPI function call.
> > 
> 
> Thanks.  As in another reply, if using spinlock is OK, then I think we can say
> it will be called either locally or through an IPI function call.  Otherwise, we
> do via a new separate function tdx_global_init() and no lock is needed in that
> function.  The caller should call it properly.

IPI must use raw_spinlock_t. I'm ok with using raw_spinlock_t if there's
actual need for that, but the code as presented didn't -- in comments or
otherwise -- make it clear why it was as it was.

TDX not specifying time constraints on the various TD/SEAM-CALLs is
ofcourse sad, but alas.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-29  0:00     ` Huang, Kai
@ 2023-06-30  9:25       ` Peter Zijlstra
  2023-06-30  9:48         ` Huang, Kai
  0 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-30  9:25 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Thu, Jun 29, 2023 at 12:00:44AM +0000, Huang, Kai wrote:

> The spec says it doesn't have a latency requirement, so theoretically it could
> be long.  SEAMCALL is a VMEXIT so it would at least cost thousands of cycles.

:-(

> If raw_spinlock isn't desired, I think I can introduce another function to do
> this and let the caller to call it before calling tdx_cpu_enable().  E.g., we
> can have below functions:
> 
> 1) tdx_global_init()	-> TDH_SYS_INIT
> 2) tdx_cpu_init()	-> TDH_SYS_LP_INIT
> 3) tdx_enable()		-> actual module initialization
> 
> How does this sound?

Ah, wait, I hadn't had enough wake-up juice, it's tdx_global_init() that
did the raw_spinlock_t, but that isn't the IPI thing.

Then perhaps just use a mutex to serialize things?



^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-29  0:10     ` Huang, Kai
@ 2023-06-30  9:26       ` Peter Zijlstra
  2023-06-30  9:55         ` Huang, Kai
  0 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-30  9:26 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote:
> On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote:
> > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> > 
> > I can't find a single caller of this.. why is this exported?
> 
> It's for KVM TDX patch to use, which isn't in this series.
> 
> I'll remove the export.  KVM TDX series can export it.

Fair enough; where will the KVM TDX series call this? Earlier there was
talk about doing it at kvm module load time -- but I objected (and still
do object) to that.

What's the current plan?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2023-06-29  9:15     ` Huang, Kai
@ 2023-06-30  9:34       ` Peter Zijlstra
  2023-06-30  9:58         ` Huang, Kai
  0 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-30  9:34 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Thu, Jun 29, 2023 at 09:15:39AM +0000, Huang, Kai wrote:

> Sure.  How about below?
> 
> +       /*
> +        * TDH.SYS.INFO writes the TDSYSINFO_STRUCT and the CMR array
> +        * to the buffers provided by the kernel (via RCX and R8
> +        * respectively).  The buffer size of the TDSYSINFO_STRUCT
> +        * (via RDX) and the maximum entries of the CMR array (via R9)
> +        * passed to this SEAMCALL must be at least the size of
> +        * TDSYSINFO_STRUCT and MAX_CMRS respectively.
> +        *
> +        * Upon a successful return, R9 contains the actual entries
> +        * written to the CMR array.
> +        */
>         sysinfo_pa = __pa(sysinfo);
>         cmr_array_pa = __pa(cmr_array);
>         ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,

> Or should I just repeat the spec like below?

> +       /*
> +        * TDH.SYS.INFO writes the TDSYSINFO_STRUCT and the CMR array
> +        * to the buffers provided by the kernel:
> +        *
> +        * Input:
> +        *  - RCX: The buffer of TDSYSINFO_STRUCT
> +        *  - RDX: The size of the TDSYSINFO_STRUCT buffer, must be at
> +        *         at least the size of TDSYSINFO_STRUCT
> +        *  - R8: The buffer of the CMR array
> +        *  - R9: The entry number of the array, must be at least
> +        *        MAX_CMRS.
> +        *
> +        * Output (successful):
> +        *  - RDX: The actual bytes written to the TDSYSINFO_STRUCT
> +        *         buffer
> +        *  - R9: The actual entries written to the CMR array.
> +        */
>         sysinfo_pa = __pa(sysinfo);
>         cmr_array_pa = __pa(cmr_array);
>         ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,

Either of them work for me, thanks!

> > SDM doesn't seem to be the place. That doesn't
> > even list TDCALL/SEAMCALL in Volume 2 :-( Let alone describe the magic
> > values.
> > 
> 
> TDX has it's own specs at here:
> 
> https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> 
> For this one you can find it in here:
> 
> https://cdrdv2.intel.com/v1/dl/getContent/733568

Yeah, eventually found it. I still think both TDCALL and SEAMCALL should
be listed in SDM Vol.2 instruction listing -- every valid instruction
should be found there IMO.

I also feel strongly that a global ABI should be decided upon for them
and the SDM would be a good place to mention that.  leaving this to
individual calls like now is a giant pain in the rear.

As is, we have TDCALL leaf-0 with a giant regset but every other leaf
has (c,d,8,9) for input and +(10,11) for output. Lets fix that in stone.

Obviously I also very strongly feel any such ABI must not confict with
pre-existing calling conventions -- IOW, using BP is out, must not
happen.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-30  9:25       ` Peter Zijlstra
@ 2023-06-30  9:48         ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-30  9:48 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, Hansen, Dave, ak,
	Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, tglx, linux-kernel, linux-mm, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On Fri, 2023-06-30 at 11:25 +0200, Peter Zijlstra wrote:
> On Thu, Jun 29, 2023 at 12:00:44AM +0000, Huang, Kai wrote:
> 
> > The spec says it doesn't have a latency requirement, so theoretically it could
> > be long.  SEAMCALL is a VMEXIT so it would at least cost thousands of cycles.
> 
> :-(
> 
> > If raw_spinlock isn't desired, I think I can introduce another function to do
> > this and let the caller to call it before calling tdx_cpu_enable().  E.g., we
> > can have below functions:
> > 
> > 1) tdx_global_init()	-> TDH_SYS_INIT
> > 2) tdx_cpu_init()	-> TDH_SYS_LP_INIT
> > 3) tdx_enable()		-> actual module initialization
> > 
> > How does this sound?
> 
> Ah, wait, I hadn't had enough wake-up juice, it's tdx_global_init() that
> did the raw_spinlock_t, but that isn't the IPI thing.
> 
> Then perhaps just use a mutex to serialize things?
> 

In the current code yes TDH_SYS_INIT is protected by raw_spinlock_t, because it
is done in tdx_cpu_enable().  I thought this makes the caller (KVM)'s life
easier as it doesn't have to call an additional tdx_global_init().

If we put TDH_SYS_INIT to an additional tdx_global_init(), then we are
essentially asking the caller to guarantee it must be called before calling any
tdx_cpu_enable() (or tdx_cpu_init() for better naming).  But in this case we
don't need the raw_spinlock anymore because it's caller's responsibility now.

They both are not protected by the TDX module initialization mutex, only
tdx_enable() is.  The caller (KVM) is supposed to call tdx_cpu_enable() for all
online cpus via IPI function call before calling tdx_enable().

So if using raw_spinlock_t around TDH_SYS_INIT is a concern, then we can go with
the dedicated tdx_global_init() function option.

Hope I've explained this clearly.






^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-30  9:26       ` Peter Zijlstra
@ 2023-06-30  9:55         ` Huang, Kai
  2023-06-30 18:30           ` Peter Zijlstra
  0 siblings, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-30  9:55 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, Hansen, Dave, ak,
	Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, tglx, linux-kernel, linux-mm, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On Fri, 2023-06-30 at 11:26 +0200, Peter Zijlstra wrote:
> On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote:
> > On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> > > 
> > > I can't find a single caller of this.. why is this exported?
> > 
> > It's for KVM TDX patch to use, which isn't in this series.
> > 
> > I'll remove the export.  KVM TDX series can export it.
> 
> Fair enough; where will the KVM TDX series call this? Earlier there was
> talk about doing it at kvm module load time -- but I objected (and still
> do object) to that.
> 
> What's the current plan?
> 

The direction is still doing it during module load (not my series anyway).  But
this can be a separate discussion with KVM maintainers involved.

I understand you have concern that you don't want to have the memory & cpu time
wasted on enabling TDX by default.  For that we can have a kernel command line
to disable TDX once for all (we can even make it default).  It's just not in
this initial TDX support series but I'll send one once this initial support is
done, as mentioned in the cover letter of the previous version (sadly I removed
this paragraph for the sake of making the cover letter shorter):

"
Also, the patch to add the new kernel comline tdx="force" isn't included
in this initial version, as Dave suggested it isn't mandatory.  But I
will add one once this initial version gets merged.
"

Also, KVM will have a module parameter 'enable_tdx'.  I am hoping this could
reduce your concern too.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory
  2023-06-30  9:34       ` Peter Zijlstra
@ 2023-06-30  9:58         ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-30  9:58 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, Hansen, Dave, ak,
	Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, tglx, linux-kernel, linux-mm, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On Fri, 2023-06-30 at 11:34 +0200, Peter Zijlstra wrote:
> On Thu, Jun 29, 2023 at 09:15:39AM +0000, Huang, Kai wrote:
> 
> > Sure.  How about below?
> > 
> > +       /*
> > +        * TDH.SYS.INFO writes the TDSYSINFO_STRUCT and the CMR array
> > +        * to the buffers provided by the kernel (via RCX and R8
> > +        * respectively).  The buffer size of the TDSYSINFO_STRUCT
> > +        * (via RDX) and the maximum entries of the CMR array (via R9)
> > +        * passed to this SEAMCALL must be at least the size of
> > +        * TDSYSINFO_STRUCT and MAX_CMRS respectively.
> > +        *
> > +        * Upon a successful return, R9 contains the actual entries
> > +        * written to the CMR array.
> > +        */
> >         sysinfo_pa = __pa(sysinfo);
> >         cmr_array_pa = __pa(cmr_array);
> >         ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
> 
> > Or should I just repeat the spec like below?
> 
> > +       /*
> > +        * TDH.SYS.INFO writes the TDSYSINFO_STRUCT and the CMR array
> > +        * to the buffers provided by the kernel:
> > +        *
> > +        * Input:
> > +        *  - RCX: The buffer of TDSYSINFO_STRUCT
> > +        *  - RDX: The size of the TDSYSINFO_STRUCT buffer, must be at
> > +        *         at least the size of TDSYSINFO_STRUCT
> > +        *  - R8: The buffer of the CMR array
> > +        *  - R9: The entry number of the array, must be at least
> > +        *        MAX_CMRS.
> > +        *
> > +        * Output (successful):
> > +        *  - RDX: The actual bytes written to the TDSYSINFO_STRUCT
> > +        *         buffer
> > +        *  - R9: The actual entries written to the CMR array.
> > +        */
> >         sysinfo_pa = __pa(sysinfo);
> >         cmr_array_pa = __pa(cmr_array);
> >         ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE,
> 
> Either of them work for me, thanks!

I will choose the first one since it's shorter.  Thanks!

> 
> > > SDM doesn't seem to be the place. That doesn't
> > > even list TDCALL/SEAMCALL in Volume 2 :-( Let alone describe the magic
> > > values.
> > > 
> > 
> > TDX has it's own specs at here:
> > 
> > https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> > 
> > For this one you can find it in here:
> > 
> > https://cdrdv2.intel.com/v1/dl/getContent/733568
> 
> Yeah, eventually found it. I still think both TDCALL and SEAMCALL should
> be listed in SDM Vol.2 instruction listing -- every valid instruction
> should be found there IMO.
> 
> I also feel strongly that a global ABI should be decided upon for them
> and the SDM would be a good place to mention that.  leaving this to
> individual calls like now is a giant pain in the rear.

Yeah I agree how the specs are organized is not ideal.  We have been having pain
during our development too.

> 
> As is, we have TDCALL leaf-0 with a giant regset but every other leaf
> has (c,d,8,9) for input and +(10,11) for output. Lets fix that in stone.
> 
> Obviously I also very strongly feel any such ABI must not confict with
> pre-existing calling conventions -- IOW, using BP is out, must not
> happen.

Fully agreed.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-30  9:03           ` kirill.shutemov
@ 2023-06-30 10:02             ` Huang, Kai
  2023-06-30 10:22               ` kirill.shutemov
  0 siblings, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-30 10:02 UTC (permalink / raw)
  To: kirill.shutemov, peterz
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Yamahata, Isaku, Chatre,
	Reinette, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao,
	Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, x86,
	Williams, Dan J


> 
> I'm okay either way.
> 
> Obviously, arch/x86/coco/tdx/tdcall.S has to be patched to use the new
> TDX_MODULE_CALL macro.
> 

Cool then we have consensus.

Kirill will you do the patch(es), or you want me to do?

Unless Peter is already having this on his hand :)

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-29 10:33       ` Huang, Kai
@ 2023-06-30 10:06         ` Peter Zijlstra
  2023-06-30 10:18           ` Huang, Kai
  2023-06-30 10:21           ` Peter Zijlstra
  0 siblings, 2 replies; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-30 10:06 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Thu, Jun 29, 2023 at 10:33:38AM +0000, Huang, Kai wrote:
> On Wed, 2023-06-28 at 22:38 +0200, Peter Zijlstra wrote:
> > On Wed, Jun 28, 2023 at 05:29:01PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 27, 2023 at 02:12:50AM +1200, Kai Huang wrote:
> > > > diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
> > > > index 49a54356ae99..757b0c34be10 100644
> > > > --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> > > > +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> > > > @@ -1,6 +1,7 @@
> > > >  /* SPDX-License-Identifier: GPL-2.0 */
> > > >  #include <asm/asm-offsets.h>
> > > >  #include <asm/tdx.h>
> > > > +#include <asm/asm.h>
> > > >  
> > > >  /*
> > > >   * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
> > > > @@ -45,6 +46,7 @@
> > > >  	/* Leave input param 2 in RDX */
> > > >  
> > > >  	.if \host
> > > > +1:
> > > >  	seamcall
> > > 
> > > So what registers are actually clobbered by SEAMCALL ? There's a
> > > distinct lack of it in SDM Vol.2 instruction list :-(
> > 
> > With the exception of the abomination that is TDH.VP.ENTER all SEAMCALLs
> > seem to be limited to the set presented here (c,d,8,9,10,11) and all
> > other registers should be available.
> 
> RAX is also used as SEAMCALL return code.
> 
> Looking at the later versions of TDX spec (with TD live migration, etc), it
> seems they are already using R12-R13 as SEAMCALL output:
> 
> https://cdrdv2.intel.com/v1/dl/getContent/733579

Urgh.. I think I read an older versio because I got bleeding eyes from
all this colour coded crap.

All this red is unreadable :-( Have they been told about the glories of
TeX and diff ?

> E.g., 6.3.15. NEW: TDH.IMPORT.MEM Leaf
> 
> It uses R12 and R13 as input.

12 and 14. They skipped 13 for some mysterious raisin.

But also, 10,11 are frequently used as input with this new stuff, which
already suggests the setup from your patches is not tenable.

> > Can we please make that a hard requirement, SEAMCALL must not use
> > registers outside this? We can hardly program to random future
> > extentions; we need hard ABI guarantees here.
> 
> 
> I believe all other GPRs are just saved/restored in SEAMCALL/SEAMRET, so in
> practice all other GPRs not used as input/output should not be clobbered.  But I
> will confirm with TDX module guys.  And even it's true in practice it's better
> to document it.  
> 
> But I think we also want to ask them to stop adding more registers as
> input/output.
> 
> I'll talk to TDX module team on this.

Please, because 12,14 are callee-saved, which means we need to go add
push/pop to preserve them :-(

Then you end up with something like this...

/*
 * TDX_MODULE_CALL - common helper macro for both
 *                 TDCALL and SEAMCALL instructions.
 *
 * TDCALL   - used by TDX guests to make requests to the
 *            TDX module and hypercalls to the VMM.
 * SEAMCALL - used by TDX hosts to make requests to the
 *            TDX module.
 *
 *-------------------------------------------------------------------------
 * TDCALL/SEAMCALL ABI:
 *-------------------------------------------------------------------------
 * Input Registers:
 *
 * RAX                 - TDCALL Leaf number.
 * RCX,RDX,R8-R11      - TDCALL Leaf specific input registers.
 *
 * Output Registers:
 *
 * RAX                 - TDCALL instruction error code.
 * RCX,RDX,R8-R11      - TDCALL Leaf specific output registers.
 * R12-R14	       - extra output registers
 *
 *-------------------------------------------------------------------------
 *
 * __tdx_module_call() function ABI:
 *
 * @fn   (RDI)         - TDCALL Leaf ID,    moved to RAX
 * @regs (RSI)         - struct tdx_regs pointer
 *
 * Return status of TDCALL via RAX.
 */
.macro TDX_MODULE_CALL host:req ret:req extra:0
	FRAME_BEGIN

	movq	%rdi, %rax
	movq	$TDX_SEAMCALL_VMFAILINVALID, %rdi

	movq	TDX_MODULE_rcx(%rsi), %rcx
	movq	TDX_MODULE_rdx(%rsi), %rdx
	movq	TDX_MODULE_r8(%rsi),  %r8
	movq	TDX_MODULE_r9(%rsi),  %r9
	movq	TDX_MODULE_r10(%rsi), %r10
	movq	TDX_MODULE_r11(%rsi), %r11
.if \extra
	pushq	r12
	pushq	r13
	pushq	r14

//	movq	TDX_MODULE_r12(%rsi), %r12
//	movq	TDX_MODULE_r13(%rsi), %r13
//	movq	TDX_MODULE_r14(%rsi), %r14
.endif

.if \host
1:	seamcall
	/*
	 * SEAMCALL instruction is essentially a VMExit from VMX root
	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
	 * that the targeted SEAM firmware is not loaded or disabled,
	 * or P-SEAMLDR is busy with another SEAMCALL.  %rax is not
	 * changed in this case.
	 *
	 * Set %rax to TDX_SEAMCALL_VMFAILINVALID for VMfailInvalid.
	 * This value will never be used as actual SEAMCALL error code as
	 * it is from the Reserved status code class.
	 */
	cmovc	%rdi, %rax
2:
.else
	tdcall
.endif

.if \ret
	movq	%rcx, TDX_MODULE_rcx(%rsi)
	movq	%rdx, TDX_MODULE_rdx(%rsi)
	movq	%r8,  TDX_MODULE_r8(%rsi)
	movq	%r9,  TDX_MODULE_r9(%rsi)
	movq	%r10, TDX_MODULE_r10(%rsi)
	movq	%r11, TDX_MODULE_r11(%rsi)
.endif
.if \extra
	movq	%r12, TDX_MODULE_r12(%rsi)
	movq	%r13, TDX_MODULE_r13(%rsi)
	movq	%r14, TDX_MODULE_r14(%rsi)

	popq	%r14
	popq	%r13
	popq	%r12
.endif

	FRAME_END
	RET

.if \host
3:
	mov	$TDX_SW_ERROR, %rdi
	or	%rdi, %rax
	jmp 2b

	_ASM_EXTABLE_FAULT(1b, 3b)
.endif
.endm

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-30  9:22                 ` Peter Zijlstra
@ 2023-06-30 10:09                   ` Huang, Kai
  2023-06-30 18:42                     ` Isaku Yamahata
  2023-07-01  8:15                     ` Huang, Kai
  0 siblings, 2 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-30 10:09 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Gao, Chao, Raj, Ashok, Shahar, Sagi, Hansen, Dave, david,
	bagasdotme, ak, Wysocki, Rafael J, linux-kernel, Chatre,
	Reinette, Christopherson,,
	Sean, pbonzini, mingo, Yamahata, Isaku, nik.borisov, tglx, Luck,
	Tony, kirill.shutemov, hpa, imammedo, linux-mm, bp, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, x86, Williams, Dan J

On Fri, 2023-06-30 at 11:22 +0200, Peter Zijlstra wrote:
> On Thu, Jun 29, 2023 at 12:15:13AM +0000, Huang, Kai wrote:
> 
> > > 	Can be called locally or through an IPI function call.
> > > 
> > 
> > Thanks.  As in another reply, if using spinlock is OK, then I think we can say
> > it will be called either locally or through an IPI function call.  Otherwise, we
> > do via a new separate function tdx_global_init() and no lock is needed in that
> > function.  The caller should call it properly.
> 
> IPI must use raw_spinlock_t. I'm ok with using raw_spinlock_t if there's
> actual need for that, but the code as presented didn't -- in comments or
> otherwise -- make it clear why it was as it was.

There's no hard requirement as I replied in another email.

Presumably you prefer the option to have a dedicated tdx_global_init() so we can
avoid the raw_spinlock_t?

Thanks.

> 
> TDX not specifying time constraints on the various TD/SEAM-CALLs is
> ofcourse sad, but alas.

Agreed.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-30 10:06         ` Peter Zijlstra
@ 2023-06-30 10:18           ` Huang, Kai
  2023-06-30 15:16             ` Dave Hansen
  2023-06-30 10:21           ` Peter Zijlstra
  1 sibling, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-06-30 10:18 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, Hansen, Dave, ak,
	Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, tglx, linux-kernel, linux-mm, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On Fri, 2023-06-30 at 12:06 +0200, Peter Zijlstra wrote:
> On Thu, Jun 29, 2023 at 10:33:38AM +0000, Huang, Kai wrote:
> > On Wed, 2023-06-28 at 22:38 +0200, Peter Zijlstra wrote:
> > > On Wed, Jun 28, 2023 at 05:29:01PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Jun 27, 2023 at 02:12:50AM +1200, Kai Huang wrote:
> > > > > diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
> > > > > index 49a54356ae99..757b0c34be10 100644
> > > > > --- a/arch/x86/virt/vmx/tdx/tdxcall.S
> > > > > +++ b/arch/x86/virt/vmx/tdx/tdxcall.S
> > > > > @@ -1,6 +1,7 @@
> > > > >  /* SPDX-License-Identifier: GPL-2.0 */
> > > > >  #include <asm/asm-offsets.h>
> > > > >  #include <asm/tdx.h>
> > > > > +#include <asm/asm.h>
> > > > >  
> > > > >  /*
> > > > >   * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
> > > > > @@ -45,6 +46,7 @@
> > > > >  	/* Leave input param 2 in RDX */
> > > > >  
> > > > >  	.if \host
> > > > > +1:
> > > > >  	seamcall
> > > > 
> > > > So what registers are actually clobbered by SEAMCALL ? There's a
> > > > distinct lack of it in SDM Vol.2 instruction list :-(
> > > 
> > > With the exception of the abomination that is TDH.VP.ENTER all SEAMCALLs
> > > seem to be limited to the set presented here (c,d,8,9,10,11) and all
> > > other registers should be available.
> > 
> > RAX is also used as SEAMCALL return code.
> > 
> > Looking at the later versions of TDX spec (with TD live migration, etc), it
> > seems they are already using R12-R13 as SEAMCALL output:
> > 
> > https://cdrdv2.intel.com/v1/dl/getContent/733579
> 
> Urgh.. I think I read an older versio because I got bleeding eyes from
> all this colour coded crap.
> 
> All this red is unreadable :-( Have they been told about the glories of
> TeX and diff ?
> 
> > E.g., 6.3.15. NEW: TDH.IMPORT.MEM Leaf
> > 
> > It uses R12 and R13 as input.
> 
> 12 and 14. They skipped 13 for some mysterious raisin.
> 
> But also, 10,11 are frequently used as input with this new stuff, which
> already suggests the setup from your patches is not tenable.
> 
> > > Can we please make that a hard requirement, SEAMCALL must not use
> > > registers outside this? We can hardly program to random future
> > > extentions; we need hard ABI guarantees here.
> > 
> > 
> > I believe all other GPRs are just saved/restored in SEAMCALL/SEAMRET, so in
> > practice all other GPRs not used as input/output should not be clobbered.  But I
> > will confirm with TDX module guys.  And even it's true in practice it's better
> > to document it.  
> > 
> > But I think we also want to ask them to stop adding more registers as
> > input/output.
> > 
> > I'll talk to TDX module team on this.
> 
> Please, because 12,14 are callee-saved, which means we need to go add
> push/pop to preserve them :-(

Yes.

However those new SEAMCALLs are for TDX guest live migration support,  which is
at a year(s)-later thing from upstreaming's point of view.  My thinking is we
can defer supporting those new SEAMCALls until that phase.  Yes we need to do
some assembly change at that time, but also looks fine to me.

How does this sound?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-30 10:06         ` Peter Zijlstra
  2023-06-30 10:18           ` Huang, Kai
@ 2023-06-30 10:21           ` Peter Zijlstra
  2023-06-30 11:05             ` Huang, Kai
  2023-06-30 12:06             ` Peter Zijlstra
  1 sibling, 2 replies; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-30 10:21 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Fri, Jun 30, 2023 at 12:07:00PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 29, 2023 at 10:33:38AM +0000, Huang, Kai wrote:

> > Looking at the later versions of TDX spec (with TD live migration, etc), it
> > seems they are already using R12-R13 as SEAMCALL output:
> > 
> > https://cdrdv2.intel.com/v1/dl/getContent/733579
> 
> Urgh.. I think I read an older versio because I got bleeding eyes from
> all this colour coded crap.
> 
> All this red is unreadable :-( Have they been told about the glories of
> TeX and diff ?
> 
> > E.g., 6.3.15. NEW: TDH.IMPORT.MEM Leaf
> > 
> > It uses R12 and R13 as input.
> 
> 12 and 14. They skipped 13 for some mysterious raisin.

Things like TDH.SERVTD.BIND do use R13.

> But also, 10,11 are frequently used as input with this new stuff, which
> already suggests the setup from your patches is not tenable.


TDG.SERVTD.RD *why* can't they pass that TD_UUID as a pointer? Using *4*
registers like that is quite insane.

TDG.VP.ENTER :-(((( that has b,15,si,di as additional output.

That means there's not a single register left unused. Can we still get
this changed, please?!?



^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-30 10:02             ` Huang, Kai
@ 2023-06-30 10:22               ` kirill.shutemov
  2023-06-30 11:06                 ` Huang, Kai
  0 siblings, 1 reply; 159+ messages in thread
From: kirill.shutemov @ 2023-06-30 10:22 UTC (permalink / raw)
  To: Huang, Kai
  Cc: peterz, kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck,
	Tony, ak, Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, Yamahata, Isaku, Chatre,
	Reinette, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao,
	Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, x86,
	Williams, Dan J

On Fri, Jun 30, 2023 at 10:02:32AM +0000, Huang, Kai wrote:
> 
> > 
> > I'm okay either way.
> > 
> > Obviously, arch/x86/coco/tdx/tdcall.S has to be patched to use the new
> > TDX_MODULE_CALL macro.
> > 
> 
> Cool then we have consensus.
> 
> Kirill will you do the patch(es), or you want me to do?

Please, do.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-30 10:21           ` Peter Zijlstra
@ 2023-06-30 11:05             ` Huang, Kai
  2023-06-30 12:06             ` Peter Zijlstra
  1 sibling, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-30 11:05 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, Hansen, Dave, ak,
	Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, tglx, linux-kernel, linux-mm, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On Fri, 2023-06-30 at 12:21 +0200, Peter Zijlstra wrote:
> On Fri, Jun 30, 2023 at 12:07:00PM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 29, 2023 at 10:33:38AM +0000, Huang, Kai wrote:
> 
> > > Looking at the later versions of TDX spec (with TD live migration, etc), it
> > > seems they are already using R12-R13 as SEAMCALL output:
> > > 
> > > https://cdrdv2.intel.com/v1/dl/getContent/733579
> > 
> > Urgh.. I think I read an older versio because I got bleeding eyes from
> > all this colour coded crap.
> > 
> > All this red is unreadable :-( Have they been told about the glories of
> > TeX and diff ?
> > 
> > > E.g., 6.3.15. NEW: TDH.IMPORT.MEM Leaf
> > > 
> > > It uses R12 and R13 as input.
> > 
> > 12 and 14. They skipped 13 for some mysterious raisin.
> 
> Things like TDH.SERVTD.BIND do use R13.
> 
> > But also, 10,11 are frequently used as input with this new stuff, which
> > already suggests the setup from your patches is not tenable.
> 
> 
> TDG.SERVTD.RD *why* can't they pass that TD_UUID as a pointer? Using *4*
> registers like that is quite insane.

I can ask TDX module team whether they can change to use a buffer instead of 4
registers.

> 
> TDG.VP.ENTER :-(((( that has b,15,si,di as additional output.
> 
> That means there's not a single register left unused. Can we still get
> this changed, please?!?
> 

I assume it's TDH.VP.ENTER.

TDH.VP.ENTER is a special SEAMCALL because it goes into TDX guest to run.  KVM
should be the only place that handles it.  And KVM won't use the __seamcall()
here to handle it but will have it's own assembly.

For normal VMX guests, since most GPRs are not saved/restored by hardware, KVM
already has sufficient code to save/restore relevant GPRs during VMENTER/VMEXIT.
Handling TDX can just follow that pattern.

So looks fine to me.



^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-30 10:22               ` kirill.shutemov
@ 2023-06-30 11:06                 ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-06-30 11:06 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: kvm, Williams, Dan J, Raj, Ashok, Hansen, Dave, david,
	bagasdotme, ak, Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, Yamahata, Isaku, nik.borisov,
	tglx, Luck, Tony, hpa, peterz, Shahar, Sagi, imammedo, bp, Gao,
	Chao, Chatre, Reinette, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, x86

On Fri, 2023-06-30 at 13:22 +0300, kirill.shutemov@linux.intel.com wrote:
> On Fri, Jun 30, 2023 at 10:02:32AM +0000, Huang, Kai wrote:
> > 
> > > 
> > > I'm okay either way.
> > > 
> > > Obviously, arch/x86/coco/tdx/tdcall.S has to be patched to use the new
> > > TDX_MODULE_CALL macro.
> > > 
> > 
> > Cool then we have consensus.
> > 
> > Kirill will you do the patch(es), or you want me to do?
> 
> Please, do.
> 

OK I'll do.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-30 10:21           ` Peter Zijlstra
  2023-06-30 11:05             ` Huang, Kai
@ 2023-06-30 12:06             ` Peter Zijlstra
  2023-06-30 15:14               ` Peter Zijlstra
  2023-07-03 12:15               ` Huang, Kai
  1 sibling, 2 replies; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-30 12:06 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Fri, Jun 30, 2023 at 12:21:41PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 30, 2023 at 12:07:00PM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 29, 2023 at 10:33:38AM +0000, Huang, Kai wrote:
> 
> > > Looking at the later versions of TDX spec (with TD live migration, etc), it
> > > seems they are already using R12-R13 as SEAMCALL output:
> > > 
> > > https://cdrdv2.intel.com/v1/dl/getContent/733579
> > 
> > Urgh.. I think I read an older versio because I got bleeding eyes from
> > all this colour coded crap.
> > 
> > All this red is unreadable :-( Have they been told about the glories of
> > TeX and diff ?
> > 
> > > E.g., 6.3.15. NEW: TDH.IMPORT.MEM Leaf
> > > 
> > > It uses R12 and R13 as input.
> > 
> > 12 and 14. They skipped 13 for some mysterious raisin.
> 
> Things like TDH.SERVTD.BIND do use R13.
> 
> > But also, 10,11 are frequently used as input with this new stuff, which
> > already suggests the setup from your patches is not tenable.
> 
> 
> TDG.SERVTD.RD *why* can't they pass that TD_UUID as a pointer? Using *4*
> registers like that is quite insane.
> 
> TDG.VP.ENTER :-(((( that has b,15,si,di as additional output.
> 
> That means there's not a single register left unused. Can we still get
> this changed, please?!?

Can't :/, VP.ENTER mirrors VP.VMCALL, so we need to deal with both.

So I think the below deals with everything and unifies __tdx_hypercall()
and __tdx_module_call(), since both sides needs to deal with exactly the
same trainwreck.


/*
 * Used for input/output registers values of the TDCALL and SEAMCALL
 * instructions when requesting services from the TDX module.
 *
 * This is a software only structure and not part of the TDX module/VMM ABI.
 */
struct tdx_module_args {
	/* callee-clobbered */
	u64 rdx;
	u64 rcx;
	u64 r8;
	u64 r9;
	/* extra callee-clobbered */
	u64 r10;
	u64 r11;
	/* callee-saved + rdi/rsi */
	u64 rdi;
	u64 rsi;
	u64 rbx;
	u64 r12;
	u64 r13;
	u64 r14;
	u64 r15;
};



/*
 * TDX_MODULE_CALL - common helper macro for both
 *                   TDCALL and SEAMCALL instructions.
 *
 * TDCALL   - used by TDX guests to make requests to the
 *            TDX module and hypercalls to the VMM.
 *
 * SEAMCALL - used by TDX hosts to make requests to the
 *            TDX module.
 *
 *-------------------------------------------------------------------------
 * TDCALL/SEAMCALL ABI:
 *-------------------------------------------------------------------------
 * Input Registers:
 *
 * RAX                 - Leaf number.
 * RCX,RDX,R8-R11      - Leaf specific input registers.
 * RDI,RSI,RBX,R11-R15 - VP.VMCALL VP.ENTER
 *
 * Output Registers:
 *
 * RAX                 - instruction error code.
 * RCX,RDX,R8-R11      - Leaf specific output registers.
 * RDI,RSI,RBX,R12-R15 - VP.VMCALL VP.ENTER
 *
 *-------------------------------------------------------------------------
 *
 * So while the common core (RAX,RCX,RDX,R8-R11) fits nicely in the
 * callee-clobbered registers and even leaves RDI,RSI free to act as a base
 * pointer some rare leafs (VP.VMCALL, VP.ENTER) make a giant mess of things.
 *
 * For simplicity, assume that anything that needs the callee-saved regs also
 * tramples on RDI,RSI. This isn't strictly true, see for example EXPORT.MEM.
 */
.macro TDX_MODULE_CALL host:req ret:req saved:0
	FRAME_BEGIN

	movq	%rdi, %rax

	movq	TDX_MODULE_rcx(%rsi), %rcx
	movq	TDX_MODULE_rdx(%rsi), %rdx
	movq	TDX_MODULE_r8(%rsi),  %r8
	movq	TDX_MODULE_r9(%rsi),  %r9
	movq	TDX_MODULE_r10(%rsi), %r10
	movq	TDX_MODULE_r11(%rsi), %r11

.if \saved
	pushq	rbx
	pushq	r12
	pushq	r13
	pushq	r14
	pushq	r15

	movq	TDX_MODULE_rbx(%rsi), %rbx
	movq	TDX_MODULE_r12(%rsi), %r12
	movq	TDX_MODULE_r13(%rsi), %r13
	movq	TDX_MODULE_r14(%rsi), %r14
	movq	TDX_MODULE_r15(%rsi), %r15

	/* VP.VMCALL and VP.ENTER */
.if \ret
	pushq	%rsi
.endif
	movq	TDX_MODULE_rdi(%rsi), %rdi
	movq	TDX_MODULE_rsi(%rsi), %rsi
.endif

.Lcall:
.if \host
	seamcall
	/*
	 * SEAMCALL instruction is essentially a VMExit from VMX root
	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
	 * that the targeted SEAM firmware is not loaded or disabled,
	 * or P-SEAMLDR is busy with another SEAMCALL. RAX is not
	 * changed in this case.
	 */
	jc	.Lseamfail

.if \saved && \ret
	/*
	 * VP.ENTER clears RSI on output, use it to restore state.
	 */
	popq	%rsi
	xor	%edi,%edi
	movq	%rdi, TDX_MODULE_rdi(%rsi)
	movq	%rdi, TDX_MODULE_rsi(%rsi)
.endif
.else
	tdcall

	/*
	 * RAX!=0 indicates a failure, assume no return values.
	 */
	testq	%rax, %rax
	jne	.Lerror

.if \saved && \ret
	/*
	 * Since RAX==0, it can be used as a scratch register to restore state.
	 *
	 * [ assumes \saved implies \ret ]
	 */
	popq	%rax
	movq	%rdi, TDX_MODULE_rdi(%rax)
	movq	%rsi, TDX_MODULE_rsi(%rax)
	movq	%rax, %rsi
	xor	%eax, %eax;
.endif
.endif // \host

.if \ret
	/* RSI is restored */
	movq	%rcx, TDX_MODULE_rcx(%rsi)
	movq	%rdx, TDX_MODULE_rdx(%rsi)
	movq	%r8,  TDX_MODULE_r8(%rsi)
	movq	%r9,  TDX_MODULE_r9(%rsi)
	movq	%r10, TDX_MODULE_r10(%rsi)
	movq	%r11, TDX_MODULE_r11(%rsi)
.if \saved
	movq	%rbx, TDX_MODULE_rbx(%rsi)
	movq	%r12, TDX_MODULE_r12(%rsi)
	movq	%r13, TDX_MODULE_r13(%rsi)
	movq	%r14, TDX_MODULE_r14(%rsi)
	movq	%r15, TDX_MODULE_r15(%rsi)
.endif
.endif // \ret

.Lout:
.if \saved
	popq	%r15
	popq	%r14
	popq	%r13
	popq	%r12
	popq	%rbx
.endif
	FRAME_END
	RET

	/*
	 * Error and exception handling at .Lcall. Ignore \ret on failure.
	 */
.Lerror:
.if \saved && \ret
	popq	%rsi
.endif
	jmp	.Lout

.if \host
.Lseamfail:
	/*
	 * Set RAX to TDX_SEAMCALL_VMFAILINVALID for VMfailInvalid.
	 * This value will never be used as actual SEAMCALL error code as
	 * it is from the Reserved status code class.
	 */
	movq	$TDX_SEAMCALL_VMFAILINVALID, %rax
	jmp	.Lerror

.Lfault:
	/*
	 * SEAMCALL caused #GP or #UD. Per _ASM_EXTABLE_FAULT() RAX
	 * contains the trap number, convert to a TDX error code by
	 * setting the high word to TDX_SW_ERROR.
	 */
	mov	$TDX_SW_ERROR, %rdi
	or	%rdi, %rax
	jmp	.Lerror

	_ASM_EXTABLE_FAULT(.Lcall, .Lfault)
.endif
.endm

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-30 12:06             ` Peter Zijlstra
@ 2023-06-30 15:14               ` Peter Zijlstra
  2023-07-03 12:15               ` Huang, Kai
  1 sibling, 0 replies; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-30 15:14 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, Shahar, Sagi,
	imammedo, bp, Gao, Chao, Brown, Len, sathyanarayanan.kuppuswamy,
	Huang, Ying, Williams, Dan J, x86

On Fri, Jun 30, 2023 at 02:06:50PM +0200, Peter Zijlstra wrote:
> /*
>  * Used for input/output registers values of the TDCALL and SEAMCALL
>  * instructions when requesting services from the TDX module.
>  *
>  * This is a software only structure and not part of the TDX module/VMM ABI.
>  */
> struct tdx_module_args {
> 	/* callee-clobbered */
> 	u64 rdx;
> 	u64 rcx;
> 	u64 r8;
> 	u64 r9;
> 	/* extra callee-clobbered */
> 	u64 r10;
> 	u64 r11;
> 	/* callee-saved + rdi/rsi */
> 	u64 rdi;
> 	u64 rsi;
> 	u64 rbx;
> 	u64 r12;
> 	u64 r13;
> 	u64 r14;
> 	u64 r15;
> };
> 
> 
> 
> /*
>  * TDX_MODULE_CALL - common helper macro for both
>  *                   TDCALL and SEAMCALL instructions.
>  *
>  * TDCALL   - used by TDX guests to make requests to the
>  *            TDX module and hypercalls to the VMM.
>  *
>  * SEAMCALL - used by TDX hosts to make requests to the
>  *            TDX module.
>  *
>  *-------------------------------------------------------------------------
>  * TDCALL/SEAMCALL ABI:
>  *-------------------------------------------------------------------------
>  * Input Registers:
>  *
>  * RAX                 - Leaf number.
>  * RCX,RDX,R8-R11      - Leaf specific input registers.
>  * RDI,RSI,RBX,R11-R15 - VP.VMCALL VP.ENTER
>  *
>  * Output Registers:
>  *
>  * RAX                 - instruction error code.
>  * RCX,RDX,R8-R11      - Leaf specific output registers.
>  * RDI,RSI,RBX,R12-R15 - VP.VMCALL VP.ENTER
>  *
>  *-------------------------------------------------------------------------
>  *
>  * So while the common core (RAX,RCX,RDX,R8-R11) fits nicely in the
>  * callee-clobbered registers and even leaves RDI,RSI free to act as a base
>  * pointer some rare leafs (VP.VMCALL, VP.ENTER) make a giant mess of things.
>  *
>  * For simplicity, assume that anything that needs the callee-saved regs also
>  * tramples on RDI,RSI. This isn't strictly true, see for example EXPORT.MEM.
>  */
> .macro TDX_MODULE_CALL host:req ret:req saved:0
> 	FRAME_BEGIN
> 
> 	movq	%rdi, %rax
> 
> 	movq	TDX_MODULE_rcx(%rsi), %rcx
> 	movq	TDX_MODULE_rdx(%rsi), %rdx
> 	movq	TDX_MODULE_r8(%rsi),  %r8
> 	movq	TDX_MODULE_r9(%rsi),  %r9
> 	movq	TDX_MODULE_r10(%rsi), %r10
> 	movq	TDX_MODULE_r11(%rsi), %r11
> 
> .if \saved
> 	pushq	rbx
> 	pushq	r12
> 	pushq	r13
> 	pushq	r14
> 	pushq	r15
> 
> 	movq	TDX_MODULE_rbx(%rsi), %rbx
> 	movq	TDX_MODULE_r12(%rsi), %r12
> 	movq	TDX_MODULE_r13(%rsi), %r13
> 	movq	TDX_MODULE_r14(%rsi), %r14
> 	movq	TDX_MODULE_r15(%rsi), %r15
> 
> 	/* VP.VMCALL and VP.ENTER */
> .if \ret
> 	pushq	%rsi
> .endif
> 	movq	TDX_MODULE_rdi(%rsi), %rdi
> 	movq	TDX_MODULE_rsi(%rsi), %rsi
> .endif
> 
> .Lcall:
> .if \host
> 	seamcall
> 	/*
> 	 * SEAMCALL instruction is essentially a VMExit from VMX root
> 	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
> 	 * that the targeted SEAM firmware is not loaded or disabled,
> 	 * or P-SEAMLDR is busy with another SEAMCALL. RAX is not
> 	 * changed in this case.
> 	 */
> 	jc	.Lseamfail
> 
> .if \saved && \ret
> 	/*
> 	 * VP.ENTER clears RSI on output, use it to restore state.
> 	 */
> 	popq	%rsi
> 	xor	%edi,%edi
> 	movq	%rdi, TDX_MODULE_rdi(%rsi)
> 	movq	%rdi, TDX_MODULE_rsi(%rsi)
> .endif
> .else
> 	tdcall
> 
> 	/*
> 	 * RAX!=0 indicates a failure, assume no return values.
> 	 */
> 	testq	%rax, %rax
> 	jne	.Lerror
> 
> .if \saved && \ret
> 	/*
> 	 * Since RAX==0, it can be used as a scratch register to restore state.
> 	 *
> 	 * [ assumes \saved implies \ret ]

This comment is wrong. As should be obvious from the condition above.

> 	 */
> 	popq	%rax
> 	movq	%rdi, TDX_MODULE_rdi(%rax)
> 	movq	%rsi, TDX_MODULE_rsi(%rax)
> 	movq	%rax, %rsi
> 	xor	%eax, %eax;
> .endif
> .endif // \host
> 
> .if \ret
> 	/* RSI is restored */
> 	movq	%rcx, TDX_MODULE_rcx(%rsi)
> 	movq	%rdx, TDX_MODULE_rdx(%rsi)
> 	movq	%r8,  TDX_MODULE_r8(%rsi)
> 	movq	%r9,  TDX_MODULE_r9(%rsi)
> 	movq	%r10, TDX_MODULE_r10(%rsi)
> 	movq	%r11, TDX_MODULE_r11(%rsi)
> .if \saved
> 	movq	%rbx, TDX_MODULE_rbx(%rsi)
> 	movq	%r12, TDX_MODULE_r12(%rsi)
> 	movq	%r13, TDX_MODULE_r13(%rsi)
> 	movq	%r14, TDX_MODULE_r14(%rsi)
> 	movq	%r15, TDX_MODULE_r15(%rsi)
> .endif
> .endif // \ret
> 
> .Lout:
> .if \saved
> 	popq	%r15
> 	popq	%r14
> 	popq	%r13
> 	popq	%r12
> 	popq	%rbx
> .endif
> 	FRAME_END
> 	RET
> 
> 	/*
> 	 * Error and exception handling at .Lcall. Ignore \ret on failure.
> 	 */
> .Lerror:
> .if \saved && \ret
> 	popq	%rsi
> .endif
> 	jmp	.Lout
> 
> .if \host
> .Lseamfail:
> 	/*
> 	 * Set RAX to TDX_SEAMCALL_VMFAILINVALID for VMfailInvalid.
> 	 * This value will never be used as actual SEAMCALL error code as
> 	 * it is from the Reserved status code class.
> 	 */
> 	movq	$TDX_SEAMCALL_VMFAILINVALID, %rax
> 	jmp	.Lerror
> 
> .Lfault:
> 	/*
> 	 * SEAMCALL caused #GP or #UD. Per _ASM_EXTABLE_FAULT() RAX
> 	 * contains the trap number, convert to a TDX error code by
> 	 * setting the high word to TDX_SW_ERROR.
> 	 */
> 	mov	$TDX_SW_ERROR, %rdi
> 	or	%rdi, %rax
> 	jmp	.Lerror
> 
> 	_ASM_EXTABLE_FAULT(.Lcall, .Lfault)
> .endif
> .endm

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-30 10:18           ` Huang, Kai
@ 2023-06-30 15:16             ` Dave Hansen
  2023-07-01  8:16               ` Huang, Kai
  0 siblings, 1 reply; 159+ messages in thread
From: Dave Hansen @ 2023-06-30 15:16 UTC (permalink / raw)
  To: Huang, Kai, peterz
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, ak, Wysocki,
	Rafael J, kirill.shutemov, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, mingo, tglx, linux-kernel, linux-mm, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On 6/30/23 03:18, Huang, Kai wrote:
>> Please, because 12,14 are callee-saved, which means we need to go add
>> push/pop to preserve them 🙁
> Yes.
> 
> However those new SEAMCALLs are for TDX guest live migration support,  which is
> at a year(s)-later thing from upstreaming's point of view.  My thinking is we
> can defer supporting those new SEAMCALls until that phase.  Yes we need to do
> some assembly change at that time, but also looks fine to me.
> 
> How does this sound?

It would sound better if the TDX module folks would take that year to
fix the module and make it nicer for Linux. :)

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-30  9:55         ` Huang, Kai
@ 2023-06-30 18:30           ` Peter Zijlstra
  2023-06-30 19:05             ` Isaku Yamahata
  0 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-06-30 18:30 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, Hansen, Dave, ak,
	Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, tglx, linux-kernel, linux-mm, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On Fri, Jun 30, 2023 at 09:55:32AM +0000, Huang, Kai wrote:
> On Fri, 2023-06-30 at 11:26 +0200, Peter Zijlstra wrote:
> > On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote:
> > > On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote:
> > > > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > > > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> > > > 
> > > > I can't find a single caller of this.. why is this exported?
> > > 
> > > It's for KVM TDX patch to use, which isn't in this series.
> > > 
> > > I'll remove the export.  KVM TDX series can export it.
> > 
> > Fair enough; where will the KVM TDX series call this? Earlier there was
> > talk about doing it at kvm module load time -- but I objected (and still
> > do object) to that.
> > 
> > What's the current plan?
> > 
> 
> The direction is still doing it during module load (not my series anyway).  But
> this can be a separate discussion with KVM maintainers involved.

They all on Cc afaict.

> I understand you have concern that you don't want to have the memory & cpu time
> wasted on enabling TDX by default.  For that we can have a kernel command line
> to disable TDX once for all (we can even make it default).

That's insane, I don't want to totally disable it. I want it done at
guard creation. Do the whole TDX setup the moment you actually create a
TDX guast.

Totally killing TDX is stupid, just about as stupid as doing it on
module load (which equates to always doing it).

> Also, KVM will have a module parameter 'enable_tdx'.  I am hoping this could
> reduce your concern too.

I don't get this obsession with doing at module load time :/

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-30 10:09                   ` Huang, Kai
@ 2023-06-30 18:42                     ` Isaku Yamahata
  2023-07-01  8:15                     ` Huang, Kai
  1 sibling, 0 replies; 159+ messages in thread
From: Isaku Yamahata @ 2023-06-30 18:42 UTC (permalink / raw)
  To: Huang, Kai
  Cc: peterz, kvm, Gao, Chao, Raj, Ashok, Shahar, Sagi, Hansen, Dave,
	david, bagasdotme, ak, Wysocki, Rafael J, linux-kernel, Chatre,
	Reinette, Christopherson,,
	Sean, pbonzini, mingo, Yamahata, Isaku, nik.borisov, tglx, Luck,
	Tony, kirill.shutemov, hpa, imammedo, linux-mm, bp, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, x86, Williams, Dan J,
	isaku.yamahata

On Fri, Jun 30, 2023 at 10:09:08AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Fri, 2023-06-30 at 11:22 +0200, Peter Zijlstra wrote:
> > On Thu, Jun 29, 2023 at 12:15:13AM +0000, Huang, Kai wrote:
> > 
> > > > 	Can be called locally or through an IPI function call.
> > > > 
> > > 
> > > Thanks.  As in another reply, if using spinlock is OK, then I think we can say
> > > it will be called either locally or through an IPI function call.  Otherwise, we
> > > do via a new separate function tdx_global_init() and no lock is needed in that
> > > function.  The caller should call it properly.
> > 
> > IPI must use raw_spinlock_t. I'm ok with using raw_spinlock_t if there's
> > actual need for that, but the code as presented didn't -- in comments or
> > otherwise -- make it clear why it was as it was.
> 
> There's no hard requirement as I replied in another email.
> 
> Presumably you prefer the option to have a dedicated tdx_global_init() so we can
> avoid the raw_spinlock_t?

TDX KVM calls tdx_cpu_enable() in IPI context as KVM hardware_setup() callback.
tdx_cpu_enable() calls tdx_global_init().
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-30 18:30           ` Peter Zijlstra
@ 2023-06-30 19:05             ` Isaku Yamahata
  2023-06-30 21:24               ` Sean Christopherson
  0 siblings, 1 reply; 159+ messages in thread
From: Isaku Yamahata @ 2023-06-30 19:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Huang, Kai, kvm, Raj, Ashok, Luck, Tony, david, bagasdotme,
	Hansen, Dave, ak, Wysocki, Rafael J, kirill.shutemov, Chatre,
	Reinette, Christopherson,,
	Sean, pbonzini, mingo, tglx, linux-kernel, linux-mm, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86, isaku.yamahata

On Fri, Jun 30, 2023 at 08:30:20PM +0200,
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Jun 30, 2023 at 09:55:32AM +0000, Huang, Kai wrote:
> > On Fri, 2023-06-30 at 11:26 +0200, Peter Zijlstra wrote:
> > > On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote:
> > > > On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote:
> > > > > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > > > > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> > > > > 
> > > > > I can't find a single caller of this.. why is this exported?
> > > > 
> > > > It's for KVM TDX patch to use, which isn't in this series.
> > > > 
> > > > I'll remove the export.  KVM TDX series can export it.
> > > 
> > > Fair enough; where will the KVM TDX series call this? Earlier there was
> > > talk about doing it at kvm module load time -- but I objected (and still
> > > do object) to that.
> > > 
> > > What's the current plan?
> > > 
> > 
> > The direction is still doing it during module load (not my series anyway).  But
> > this can be a separate discussion with KVM maintainers involved.
> 
> They all on Cc afaict.
> 
> > I understand you have concern that you don't want to have the memory & cpu time
> > wasted on enabling TDX by default.  For that we can have a kernel command line
> > to disable TDX once for all (we can even make it default).
> 
> That's insane, I don't want to totally disable it. I want it done at
> guard creation. Do the whole TDX setup the moment you actually create a
> TDX guast.
> 
> Totally killing TDX is stupid, just about as stupid as doing it on
> module load (which equates to always doing it).
> 
> > Also, KVM will have a module parameter 'enable_tdx'.  I am hoping this could
> > reduce your concern too.
> 
> I don't get this obsession with doing at module load time :/

The KVM maintainers prefer the initialization on kvm_intel.ko loading time. [1]
I can change enable_tdx parameter for kvm_intel.ko instead of boolean.
Something like

enable_tdx
        ondemand: on-demand initialization when creating the first TDX guest
        onload:   initialize TDX module when loading kvm_intel.ko
        disable:  disable TDX support
        

[1] https://lore.kernel.org/lkml/YkTvw5OXTTFf7j4y@google.com/
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-30 19:05             ` Isaku Yamahata
@ 2023-06-30 21:24               ` Sean Christopherson
  2023-06-30 21:58                 ` Dan Williams
                                   ` (3 more replies)
  0 siblings, 4 replies; 159+ messages in thread
From: Sean Christopherson @ 2023-06-30 21:24 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Peter Zijlstra, Kai Huang, kvm, Ashok Raj, Tony Luck, david,
	bagasdotme, Dave Hansen, ak, Rafael J Wysocki, kirill.shutemov,
	Reinette Chatre, pbonzini, mingo, tglx, linux-kernel, linux-mm,
	Isaku Yamahata, nik.borisov, hpa, Sagi Shahar, imammedo, bp,
	Chao Gao, Len Brown, sathyanarayanan.kuppuswamy, Ying Huang,
	Dan J Williams, x86

On Fri, Jun 30, 2023, Isaku Yamahata wrote:
> On Fri, Jun 30, 2023 at 08:30:20PM +0200,
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Jun 30, 2023 at 09:55:32AM +0000, Huang, Kai wrote:
> > > On Fri, 2023-06-30 at 11:26 +0200, Peter Zijlstra wrote:
> > > > On Thu, Jun 29, 2023 at 12:10:00AM +0000, Huang, Kai wrote:
> > > > > On Wed, 2023-06-28 at 15:17 +0200, Peter Zijlstra wrote:
> > > > > > On Tue, Jun 27, 2023 at 02:12:37AM +1200, Kai Huang wrote:
> > > > > > > +EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> > > > > > 
> > > > > > I can't find a single caller of this.. why is this exported?
> > > > > 
> > > > > It's for KVM TDX patch to use, which isn't in this series.
> > > > > 
> > > > > I'll remove the export.  KVM TDX series can export it.
> > > > 
> > > > Fair enough; where will the KVM TDX series call this? Earlier there was
> > > > talk about doing it at kvm module load time -- but I objected (and still
> > > > do object) to that.
> > > > 
> > > > What's the current plan?
> > > > 
> > > 
> > > The direction is still doing it during module load (not my series anyway).  But
> > > this can be a separate discussion with KVM maintainers involved.
> > 
> > They all on Cc afaict.
> > 
> > > I understand you have concern that you don't want to have the memory & cpu time
> > > wasted on enabling TDX by default.  For that we can have a kernel command line
> > > to disable TDX once for all (we can even make it default).
> > 
> > That's insane, I don't want to totally disable it. I want it done at
> > guard creation. Do the whole TDX setup the moment you actually create a
> > TDX guast.
> > 
> > Totally killing TDX is stupid, 

I dunno about that, *totally* killing TDX would make my life a lot simpler ;-)

> > just about as stupid as doing it on module load (which equates to always
> > doing it).
> > 
> > > Also, KVM will have a module parameter 'enable_tdx'.  I am hoping this could
> > > reduce your concern too.
> > 
> > I don't get this obsession with doing at module load time :/

Waiting until userspace attempts to create the first TDX guest adds complexity
and limits what KVM can do to harden itself.  Currently, all feature support in
KVM is effectively frozen at module load.  E.g. most of the setup code is
contained in __init functions, many module-scoped variables are effectively 
RO after init (though they can't be marked as such until we smush kvm-intel.ko
and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan).  All of
those patterns would get tossed aside if KVM waits until userspace attempts to
create the first guest.

The userspace experience would also be poor, as KVM can't know whether or TDX is
actually supported until the TDX module is fully loaded and configured.  KVM waits
until VM creation to enable VMX, but that's pure enabling and more or less
guaranteed to succeed, e.g. will succeed barring hardware failures, software bugs,
or *severe* memory pressure.

There are also latency and noisy neighbor concerns, e.g. we *really* don't want
to end up in a situation where creating a TDX guest for a customer can observe
arbitrary latency *and* potentially be disruptive to VMs already running on the
host.

Userspace can workaround the second and third issues by spawning a dummy TDX guest
as early as possible, but that adds complexity to userspace, especially if there's
any desire for it to be race free, e.g. with respect to reporting system capabilities
to the control plan.

On the flip side, limited hardware availability (unless Intel has changed its
tune) and the amount of enabling that's required in BIOS and whatnot makes it
highly unlikely that random Linux users are going to unknowingly boot with TDX
enabled.

That said, if this is a sticking point, let's just make enable_tdx off by default,
i.e. force userspace to opt-in.  Deployments that *know* they may want to schedule
TDX VMs on the host can simply force the module param.  And for everyone else,
since KVM is typically configured as a module by distros, KVM can be unloaded and
reload if the user realizes they want TDX well after the system is up and running.

> The KVM maintainers prefer the initialization on kvm_intel.ko loading time. [1]

You can say "Sean", I'm not the bogeyman :-)

> I can change enable_tdx parameter for kvm_intel.ko instead of boolean.
> Something like
> 
> enable_tdx
>         ondemand: on-demand initialization when creating the first TDX guest
>         onload:   initialize TDX module when loading kvm_intel.ko

No, that's the most complex path and makes no one happy.

>         disable:  disable TDX support

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-30 21:24               ` Sean Christopherson
@ 2023-06-30 21:58                 ` Dan Williams
  2023-06-30 23:13                 ` Dave Hansen
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 159+ messages in thread
From: Dan Williams @ 2023-06-30 21:58 UTC (permalink / raw)
  To: Sean Christopherson, Isaku Yamahata
  Cc: Peter Zijlstra, Kai Huang, kvm, Ashok Raj, Tony Luck, david,
	bagasdotme, Dave Hansen, ak, Rafael J Wysocki, kirill.shutemov,
	Reinette Chatre, pbonzini, mingo, tglx, linux-kernel, linux-mm,
	Isaku Yamahata, nik.borisov, hpa, Sagi Shahar, imammedo, bp,
	Chao Gao, Len Brown, sathyanarayanan.kuppuswamy, Ying Huang,
	Dan J Williams, x86

Sean Christopherson wrote:
> On Fri, Jun 30, 2023, Isaku Yamahata wrote:
> > On Fri, Jun 30, 2023 at 08:30:20PM +0200,
> > Peter Zijlstra <peterz@infradead.org> wrote:
[..]
> On the flip side, limited hardware availability (unless Intel has changed its
> tune) and the amount of enabling that's required in BIOS and whatnot makes it
> highly unlikely that random Linux users are going to unknowingly boot with TDX
> enabled.
> 
> That said, if this is a sticking point, let's just make enable_tdx off by default,
> i.e. force userspace to opt-in.  Deployments that *know* they may want to schedule
> TDX VMs on the host can simply force the module param.  And for everyone else,
> since KVM is typically configured as a module by distros, KVM can be unloaded and
> reload if the user realizes they want TDX well after the system is up and running.

Another potential option that also avoids the concern that module
parameters are unwieldy [1] is to have kvm_intel have a soft-dependency
on something like a kvm_intel_tdx module. That affords both a BIOS *and*
userspace policy opt-out where kvm_intel.ko can check that
kvm_intel_tdx.ko is present at init time, or proceed with tdx disabled.

[1]: http://lore.kernel.org/r/Y7z99mf1M5edxV4A@kroah.com

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-30 21:24               ` Sean Christopherson
  2023-06-30 21:58                 ` Dan Williams
@ 2023-06-30 23:13                 ` Dave Hansen
  2023-07-03 10:38                   ` Peter Zijlstra
  2023-07-03 10:49                 ` Peter Zijlstra
  2023-07-04 16:58                 ` Peter Zijlstra
  3 siblings, 1 reply; 159+ messages in thread
From: Dave Hansen @ 2023-06-30 23:13 UTC (permalink / raw)
  To: Sean Christopherson, Isaku Yamahata
  Cc: Peter Zijlstra, Kai Huang, kvm, Ashok Raj, Tony Luck, david,
	bagasdotme, ak, Rafael J Wysocki, kirill.shutemov,
	Reinette Chatre, pbonzini, mingo, tglx, linux-kernel, linux-mm,
	Isaku Yamahata, nik.borisov, hpa, Sagi Shahar, imammedo, bp,
	Chao Gao, Len Brown, sathyanarayanan.kuppuswamy, Ying Huang,
	Dan J Williams, x86

On 6/30/23 14:24, Sean Christopherson wrote:
> That said, if this is a sticking point, let's just make enable_tdx off by default,
> i.e. force userspace to opt-in.  Deployments that *know* they may want to schedule
> TDX VMs on the host can simply force the module param.  And for everyone else,
> since KVM is typically configured as a module by distros, KVM can be unloaded and
> reload if the user realizes they want TDX well after the system is up and running.

Let's just default it to off for now.

If we default it to on, we risk inflicting TDX on existing KVM users
that don't want it (by surprise).  If it turns out to _that_ big of an
inconvenience, we'd have to reverse course and change the default from
on=>off.  *That* would break existing TDX users when we do it.  Gnashing
of teeth all around would ensue.

On the other hand, if we force TDX users to turn it on from day one, we
don't surprise _anyone_ that wasn't asking for it.  The only teeth
gnashing is for the TDX folks.

We could change _that_ down the line if the TDX users get too rowdy.
But I'd much rather err on the side of inconveniencing the guys that
know they want the snazzy new hardware than those who just want to run
plain old VMs.

I honestly don't care all that much either way.  There's an escape hatch
at runtime (reload kvm_intel.ko) no matter what we do.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-30 10:09                   ` Huang, Kai
  2023-06-30 18:42                     ` Isaku Yamahata
@ 2023-07-01  8:15                     ` Huang, Kai
  1 sibling, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-07-01  8:15 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Williams, Dan J, Raj, Ashok, Hansen, Dave, david,
	bagasdotme, ak, Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, Chatre, Reinette, pbonzini, mingo, Yamahata, Isaku,
	nik.borisov, tglx, Luck, Tony, kirill.shutemov, Shahar, Sagi,
	imammedo, hpa, Gao, Chao, bp, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, linux-mm, x86

On Fri, 2023-06-30 at 10:09 +0000, Huang, Kai wrote:
> On Fri, 2023-06-30 at 11:22 +0200, Peter Zijlstra wrote:
> > On Thu, Jun 29, 2023 at 12:15:13AM +0000, Huang, Kai wrote:
> > 
> > > > 	Can be called locally or through an IPI function call.
> > > > 
> > > 
> > > Thanks.  As in another reply, if using spinlock is OK, then I think we can say
> > > it will be called either locally or through an IPI function call.  Otherwise, we
> > > do via a new separate function tdx_global_init() and no lock is needed in that
> > > function.  The caller should call it properly.
> > 
> > IPI must use raw_spinlock_t. I'm ok with using raw_spinlock_t if there's
> > actual need for that, but the code as presented didn't -- in comments or
> > otherwise -- make it clear why it was as it was.
> 
> There's no hard requirement as I replied in another email.
> 
> Presumably you prefer the option to have a dedicated tdx_global_init() so we can
> avoid the raw_spinlock_t?
> 

Hmm... didn't have enough coffee.  Sorry after more thinking, I think we need to
avoid tdx_global_init() but do TDH.SYS.INIT within tdx_cpu_enable() with
raw_spinlock_t.  The reason is although KVM will be the first caller of TDX,
there will be other caller of TDX in later phase (e.g., IOMMU TDX support) so we
need to consider race between those callers.

With multiple callers, the tdx_global_init() and tdx_cpu_enable() from them need
to be serialized anyway, and having the additional tdx_global_init() will just
make things more complicated to do.

So I think the simplest way is to use a per-cpu variable to track
TDH.SYS.LP.INIT in tdx_cpu_enable() and only call tdx_cpu_enable() from local
with IRQ disabled or from IPI function call, and use raw_spinlock_t for
TDH.SYS.INIT inside tdx_cpu_enable() to make sure it only gets called once.

I'll clarify this in the changelog and/or comments.

Again sorry for the noise and please let me know for any comments.  Thanks! 



^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-30 15:16             ` Dave Hansen
@ 2023-07-01  8:16               ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-07-01  8:16 UTC (permalink / raw)
  To: peterz, Hansen, Dave
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, ak, Wysocki,
	Rafael J, linux-kernel, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, mingo, Yamahata, Isaku, kirill.shutemov, tglx,
	nik.borisov, linux-mm, hpa, Shahar, Sagi, imammedo, bp, Gao,
	Chao, Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying,
	Williams, Dan J, x86

On Fri, 2023-06-30 at 08:16 -0700, Dave Hansen wrote:
> On 6/30/23 03:18, Huang, Kai wrote:
> > > Please, because 12,14 are callee-saved, which means we need to go add
> > > push/pop to preserve them 🙁
> > Yes.
> > 
> > However those new SEAMCALLs are for TDX guest live migration support,  which is
> > at a year(s)-later thing from upstreaming's point of view.  My thinking is we
> > can defer supporting those new SEAMCALls until that phase.  Yes we need to do
> > some assembly change at that time, but also looks fine to me.
> > 
> > How does this sound?
> 
> It would sound better if the TDX module folks would take that year to
> fix the module and make it nicer for Linux. :)

Yeah agreed.  And we can push them to do when we find something needs to be
improved. :)

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-30 23:13                 ` Dave Hansen
@ 2023-07-03 10:38                   ` Peter Zijlstra
  0 siblings, 0 replies; 159+ messages in thread
From: Peter Zijlstra @ 2023-07-03 10:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Sean Christopherson, Isaku Yamahata, Kai Huang, kvm, Ashok Raj,
	Tony Luck, david, bagasdotme, ak, Rafael J Wysocki,
	kirill.shutemov, Reinette Chatre, pbonzini, mingo, tglx,
	linux-kernel, linux-mm, Isaku Yamahata, nik.borisov, hpa,
	Sagi Shahar, imammedo, bp, Chao Gao, Len Brown,
	sathyanarayanan.kuppuswamy, Ying Huang, Dan J Williams, x86

On Fri, Jun 30, 2023 at 04:13:39PM -0700, Dave Hansen wrote:

> I honestly don't care all that much either way.  There's an escape hatch
> at runtime (reload kvm_intel.ko) no matter what we do.

Please try with my MODULE=n kernel ;-) localyesconfig FTW.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-30 21:24               ` Sean Christopherson
  2023-06-30 21:58                 ` Dan Williams
  2023-06-30 23:13                 ` Dave Hansen
@ 2023-07-03 10:49                 ` Peter Zijlstra
  2023-07-03 14:40                   ` Dave Hansen
  2023-07-04 16:58                 ` Peter Zijlstra
  3 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-07-03 10:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Kai Huang, kvm, Ashok Raj, Tony Luck, david,
	bagasdotme, Dave Hansen, ak, Rafael J Wysocki, kirill.shutemov,
	Reinette Chatre, pbonzini, mingo, tglx, linux-kernel, linux-mm,
	Isaku Yamahata, nik.borisov, hpa, Sagi Shahar, imammedo, bp,
	Chao Gao, Len Brown, sathyanarayanan.kuppuswamy, Ying Huang,
	Dan J Williams, x86

On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote:

> I dunno about that, *totally* killing TDX would make my life a lot simpler ;-)

:-)

> > > I don't get this obsession with doing at module load time :/
> 
> Waiting until userspace attempts to create the first TDX guest adds complexity
> and limits what KVM can do to harden itself.  Currently, all feature support in
> KVM is effectively frozen at module load.  E.g. most of the setup code is
> contained in __init functions, many module-scoped variables are effectively 
> RO after init (though they can't be marked as such until we smush kvm-intel.ko
> and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan).  All of
> those patterns would get tossed aside if KVM waits until userspace attempts to
> create the first guest.

Pff, all that is perfectly possible, just a wee bit more work :-) I
mean, we manage to poke text that's RO, surely we can poke a variable
that supposedly RO.

And I really wish we could put part of the kvm-intel/amd.ko things in
the kernel proper and reduce the EXPORT_SYMBOL surface -- we're
exporting a whole bunch of things that really shouldn't be, just for KVM
:/

> The userspace experience would also be poor, as KVM can't know whether or TDX is
> actually supported until the TDX module is fully loaded and configured.

Quality that :-(

> There are also latency and noisy neighbor concerns, e.g. we *really* don't want
> to end up in a situation where creating a TDX guest for a customer can observe
> arbitrary latency *and* potentially be disruptive to VMs already running on the
> host.

Well, that's a quality of implementation issue with the whole TDX
crapola. Sounds like we want to impose latency constraints on the
various TDX calls. Allowing it to consume arbitrary amounts of CPU time
is unacceptable in any case.

> Userspace can workaround the second and third issues by spawning a dummy TDX guest
> as early as possible, but that adds complexity to userspace, especially if there's
> any desire for it to be race free, e.g. with respect to reporting system capabilities
> to the control plan.

FWIW, I'm 100% behind pushing complexity into userspace if it makes for
a simpler kernel.

> On the flip side, limited hardware availability (unless Intel has changed its
> tune) and the amount of enabling that's required in BIOS and whatnot makes it
> highly unlikely that random Linux users are going to unknowingly boot with TDX
> enabled.
> 
> That said, if this is a sticking point, let's just make enable_tdx off by default,

OK.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-06-30 12:06             ` Peter Zijlstra
  2023-06-30 15:14               ` Peter Zijlstra
@ 2023-07-03 12:15               ` Huang, Kai
  2023-07-05 10:21                 ` Peter Zijlstra
  1 sibling, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-07-03 12:15 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, Hansen, Dave, ak,
	Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, tglx, linux-kernel, linux-mm, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86


> 
> So I think the below deals with everything and unifies __tdx_hypercall()
> and __tdx_module_call(), since both sides needs to deal with exactly the
> same trainwreck.

Hi Peter,

Just want to make sure I understand you correctly:

You want to make __tdx_module_call() look like __tdx_hypercall(), but not to
unify them into one assembly (at least for now), right?

I am confused you mentioned VP.VMCALL below, which is handled by
__tdx_hypercall().

> 
> 
> /*
>  * Used for input/output registers values of the TDCALL and SEAMCALL
>  * instructions when requesting services from the TDX module.
>  *
>  * This is a software only structure and not part of the TDX module/VMM ABI.
>  */
> struct tdx_module_args {
> 	/* callee-clobbered */
> 	u64 rdx;
> 	u64 rcx;
> 	u64 r8;
> 	u64 r9;
> 	/* extra callee-clobbered */
> 	u64 r10;
> 	u64 r11;
> 	/* callee-saved + rdi/rsi */
> 	u64 rdi;
> 	u64 rsi;
> 	u64 rbx;
> 	u64 r12;
> 	u64 r13;
> 	u64 r14;
> 	u64 r15;
> };
> 
> 
> 
> /*
>  * TDX_MODULE_CALL - common helper macro for both
>  *                   TDCALL and SEAMCALL instructions.
>  *
>  * TDCALL   - used by TDX guests to make requests to the
>  *            TDX module and hypercalls to the VMM.
>  *
>  * SEAMCALL - used by TDX hosts to make requests to the
>  *            TDX module.
>  *
>  *-------------------------------------------------------------------------
>  * TDCALL/SEAMCALL ABI:
>  *-------------------------------------------------------------------------
>  * Input Registers:
>  *
>  * RAX                 - Leaf number.
>  * RCX,RDX,R8-R11      - Leaf specific input registers.
>  * RDI,RSI,RBX,R11-R15 - VP.VMCALL VP.ENTER
>  *
>  * Output Registers:
>  *
>  * RAX                 - instruction error code.
>  * RCX,RDX,R8-R11      - Leaf specific output registers.
>  * RDI,RSI,RBX,R12-R15 - VP.VMCALL VP.ENTER

As mentioned above, VP.VMCALL is handled by __tdx_hypercall().  Also, VP.ENTER
will be handled by KVM's own assembly.  They both are not handled in this
TDX_MODULE_CALL assembly.

>  *
>  *-------------------------------------------------------------------------
>  *
>  * So while the common core (RAX,RCX,RDX,R8-R11) fits nicely in the
>  * callee-clobbered registers and even leaves RDI,RSI free to act as a base
>  * pointer some rare leafs (VP.VMCALL, VP.ENTER) make a giant mess of things.
>  *
>  * For simplicity, assume that anything that needs the callee-saved regs also
>  * tramples on RDI,RSI. This isn't strictly true, see for example EXPORT.MEM.
>  */
> .macro TDX_MODULE_CALL host:req ret:req saved:0
> 	FRAME_BEGIN
> 
> 	movq	%rdi, %rax
> 
> 	movq	TDX_MODULE_rcx(%rsi), %rcx
> 	movq	TDX_MODULE_rdx(%rsi), %rdx
> 	movq	TDX_MODULE_r8(%rsi),  %r8
> 	movq	TDX_MODULE_r9(%rsi),  %r9
> 	movq	TDX_MODULE_r10(%rsi), %r10
> 	movq	TDX_MODULE_r11(%rsi), %r11
> 
> .if \saved
> 	pushq	rbx
> 	pushq	r12
> 	pushq	r13
> 	pushq	r14
> 	pushq	r15
> 
> 	movq	TDX_MODULE_rbx(%rsi), %rbx
> 	movq	TDX_MODULE_r12(%rsi), %r12
> 	movq	TDX_MODULE_r13(%rsi), %r13
> 	movq	TDX_MODULE_r14(%rsi), %r14
> 	movq	TDX_MODULE_r15(%rsi), %r15
> 
> 	/* VP.VMCALL and VP.ENTER */
> .if \ret
> 	pushq	%rsi
> .endif
> 	movq	TDX_MODULE_rdi(%rsi), %rdi
> 	movq	TDX_MODULE_rsi(%rsi), %rsi
> .endif
> 
> .Lcall:
> .if \host
> 	seamcall
> 	/*
> 	 * SEAMCALL instruction is essentially a VMExit from VMX root
> 	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
> 	 * that the targeted SEAM firmware is not loaded or disabled,
> 	 * or P-SEAMLDR is busy with another SEAMCALL. RAX is not
> 	 * changed in this case.
> 	 */
> 	jc	.Lseamfail
> 
> .if \saved && \ret
> 	/*
> 	 * VP.ENTER clears RSI on output, use it to restore state.
> 	 */
> 	popq	%rsi
> 	xor	%edi,%edi
> 	movq	%rdi, TDX_MODULE_rdi(%rsi)
> 	movq	%rdi, TDX_MODULE_rsi(%rsi)
> .endif
> .else
> 	tdcall
> 
> 	/*
> 	 * RAX!=0 indicates a failure, assume no return values.
> 	 */
> 	testq	%rax, %rax
> 	jne	.Lerror

For some SEAMCALL/TDCALL the output registers may contain additional error
information.  We need to jump to a location where whether returning those
additional regs to 'struct tdx_module_args' depends on \ret.

> 
> .if \saved && \ret
> 	/*
> 	 * Since RAX==0, it can be used as a scratch register to restore state.
> 	 *
> 	 * [ assumes \saved implies \ret ]
> 	 */
> 	popq	%rax
> 	movq	%rdi, TDX_MODULE_rdi(%rax)
> 	movq	%rsi, TDX_MODULE_rsi(%rax)
> 	movq	%rax, %rsi
> 	xor	%eax, %eax;
> .endif
> .endif // \host
> 
> .if \ret
> 	/* RSI is restored */
> 	movq	%rcx, TDX_MODULE_rcx(%rsi)
> 	movq	%rdx, TDX_MODULE_rdx(%rsi)
> 	movq	%r8,  TDX_MODULE_r8(%rsi)
> 	movq	%r9,  TDX_MODULE_r9(%rsi)
> 	movq	%r10, TDX_MODULE_r10(%rsi)
> 	movq	%r11, TDX_MODULE_r11(%rsi)
> .if \saved
> 	movq	%rbx, TDX_MODULE_rbx(%rsi)
> 	movq	%r12, TDX_MODULE_r12(%rsi)
> 	movq	%r13, TDX_MODULE_r13(%rsi)
> 	movq	%r14, TDX_MODULE_r14(%rsi)
> 	movq	%r15, TDX_MODULE_r15(%rsi)
> .endif
> .endif // \ret
> 
> .Lout:
> .if \saved
> 	popq	%r15
> 	popq	%r14
> 	popq	%r13
> 	popq	%r12
> 	popq	%rbx
> .endif
> 	FRAME_END
> 	RET
> 
> 	/*
> 	 * Error and exception handling at .Lcall. Ignore \ret on failure.
> 	 */
> .Lerror:
> .if \saved && \ret
> 	popq	%rsi
> .endif
> 	jmp	.Lout
> 
> .if \host
> .Lseamfail:
> 	/*
> 	 * Set RAX to TDX_SEAMCALL_VMFAILINVALID for VMfailInvalid.
> 	 * This value will never be used as actual SEAMCALL error code as
> 	 * it is from the Reserved status code class.
> 	 */
> 	movq	$TDX_SEAMCALL_VMFAILINVALID, %rax
> 	jmp	.Lerror
> 
> .Lfault:
> 	/*
> 	 * SEAMCALL caused #GP or #UD. Per _ASM_EXTABLE_FAULT() RAX
> 	 * contains the trap number, convert to a TDX error code by
> 	 * setting the high word to TDX_SW_ERROR.
> 	 */
> 	mov	$TDX_SW_ERROR, %rdi
> 	or	%rdi, %rax
> 	jmp	.Lerror
> 
> 	_ASM_EXTABLE_FAULT(.Lcall, .Lfault)
> .endif
> .endm


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-07-03 10:49                 ` Peter Zijlstra
@ 2023-07-03 14:40                   ` Dave Hansen
  2023-07-03 15:03                     ` Peter Zijlstra
  0 siblings, 1 reply; 159+ messages in thread
From: Dave Hansen @ 2023-07-03 14:40 UTC (permalink / raw)
  To: Peter Zijlstra, Sean Christopherson
  Cc: Isaku Yamahata, Kai Huang, kvm, Ashok Raj, Tony Luck, david,
	bagasdotme, ak, Rafael J Wysocki, kirill.shutemov,
	Reinette Chatre, pbonzini, mingo, tglx, linux-kernel, linux-mm,
	Isaku Yamahata, nik.borisov, hpa, Sagi Shahar, imammedo, bp,
	Chao Gao, Len Brown, sathyanarayanan.kuppuswamy, Ying Huang,
	Dan J Williams, x86

On 7/3/23 03:49, Peter Zijlstra wrote:
>> There are also latency and noisy neighbor concerns, e.g. we *really* don't want
>> to end up in a situation where creating a TDX guest for a customer can observe
>> arbitrary latency *and* potentially be disruptive to VMs already running on the
>> host.
> Well, that's a quality of implementation issue with the whole TDX
> crapola. Sounds like we want to impose latency constraints on the
> various TDX calls. Allowing it to consume arbitrary amounts of CPU time
> is unacceptable in any case.

For what it's worth, everybody knew that calling into the TDX module was
going to be a black hole and that consuming large amounts of CPU at
random times would drive people bat guano crazy.

The TDX Module ABI spec does have "Leaf Function Latency" warnings for
some of the module calls.  But, it's basically a binary thing.  A call
is either normal or "longer than most".

The majority of the "longer than most" cases are for initialization.
The _most_ obscene runtime ones are chunked up and can return partial
progress to limit latency spikes.  But I don't think folks tried as hard
on the initialization calls since they're only called once which
actually seems pretty reasonable to me.

Maybe we need three classes of "Leaf Function Latency":
1. Sane
2. "Longer than most"
3. Better turn the NMI watchdog off before calling this. :)

Would that help?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-07-03 14:40                   ` Dave Hansen
@ 2023-07-03 15:03                     ` Peter Zijlstra
  2023-07-03 15:26                       ` Dave Hansen
  2023-07-03 17:55                       ` kirill.shutemov
  0 siblings, 2 replies; 159+ messages in thread
From: Peter Zijlstra @ 2023-07-03 15:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Sean Christopherson, Isaku Yamahata, Kai Huang, kvm, Ashok Raj,
	Tony Luck, david, bagasdotme, ak, Rafael J Wysocki,
	kirill.shutemov, Reinette Chatre, pbonzini, mingo, tglx,
	linux-kernel, linux-mm, Isaku Yamahata, nik.borisov, hpa,
	Sagi Shahar, imammedo, bp, Chao Gao, Len Brown,
	sathyanarayanan.kuppuswamy, Ying Huang, Dan J Williams, x86

On Mon, Jul 03, 2023 at 07:40:55AM -0700, Dave Hansen wrote:
> On 7/3/23 03:49, Peter Zijlstra wrote:
> >> There are also latency and noisy neighbor concerns, e.g. we *really* don't want
> >> to end up in a situation where creating a TDX guest for a customer can observe
> >> arbitrary latency *and* potentially be disruptive to VMs already running on the
> >> host.
> > Well, that's a quality of implementation issue with the whole TDX
> > crapola. Sounds like we want to impose latency constraints on the
> > various TDX calls. Allowing it to consume arbitrary amounts of CPU time
> > is unacceptable in any case.
> 
> For what it's worth, everybody knew that calling into the TDX module was
> going to be a black hole and that consuming large amounts of CPU at
> random times would drive people bat guano crazy.
> 
> The TDX Module ABI spec does have "Leaf Function Latency" warnings for
> some of the module calls.  But, it's basically a binary thing.  A call
> is either normal or "longer than most".
> 
> The majority of the "longer than most" cases are for initialization.
> The _most_ obscene runtime ones are chunked up and can return partial
> progress to limit latency spikes.  But I don't think folks tried as hard
> on the initialization calls since they're only called once which
> actually seems pretty reasonable to me.
> 
> Maybe we need three classes of "Leaf Function Latency":
> 1. Sane
> 2. "Longer than most"
> 3. Better turn the NMI watchdog off before calling this. :)
> 
> Would that help?

I'm thikning we want something along the lines of the Xen preemptible
hypercalls, except less crazy. Where the caller does:

	for (;;) {
		ret = tdcall(fn, args);
		if (ret == -EAGAIN) {
			cond_resched();
			continue;
		}
		break;
	}

And then the TDX black box provides a guarantee that any one tdcall (or
seamcall or whatever) never takes more than X ns (possibly even
configurable) and we get to raise a bug report if we can prove it
actually takes longer.

Handing the CPU off to random code for random period of time is just not
a good idea, ever.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-07-03 15:03                     ` Peter Zijlstra
@ 2023-07-03 15:26                       ` Dave Hansen
  2023-07-03 17:55                       ` kirill.shutemov
  1 sibling, 0 replies; 159+ messages in thread
From: Dave Hansen @ 2023-07-03 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sean Christopherson, Isaku Yamahata, Kai Huang, kvm, Ashok Raj,
	Tony Luck, david, bagasdotme, ak, Rafael J Wysocki,
	kirill.shutemov, Reinette Chatre, pbonzini, mingo, tglx,
	linux-kernel, linux-mm, Isaku Yamahata, nik.borisov, hpa,
	Sagi Shahar, imammedo, bp, Chao Gao, Len Brown,
	sathyanarayanan.kuppuswamy, Ying Huang, Dan J Williams, x86

On 7/3/23 08:03, Peter Zijlstra wrote:
> On Mon, Jul 03, 2023 at 07:40:55AM -0700, Dave Hansen wrote:
>> On 7/3/23 03:49, Peter Zijlstra wrote:
>>>> There are also latency and noisy neighbor concerns, e.g. we *really* don't want
>>>> to end up in a situation where creating a TDX guest for a customer can observe
>>>> arbitrary latency *and* potentially be disruptive to VMs already running on the
>>>> host.
>>> Well, that's a quality of implementation issue with the whole TDX
>>> crapola. Sounds like we want to impose latency constraints on the
>>> various TDX calls. Allowing it to consume arbitrary amounts of CPU time
>>> is unacceptable in any case.
>>
>> For what it's worth, everybody knew that calling into the TDX module was
>> going to be a black hole and that consuming large amounts of CPU at
>> random times would drive people bat guano crazy.
>>
>> The TDX Module ABI spec does have "Leaf Function Latency" warnings for
>> some of the module calls.  But, it's basically a binary thing.  A call
>> is either normal or "longer than most".
>>
>> The majority of the "longer than most" cases are for initialization.
>> The _most_ obscene runtime ones are chunked up and can return partial
>> progress to limit latency spikes.  But I don't think folks tried as hard
>> on the initialization calls since they're only called once which
>> actually seems pretty reasonable to me.
>>
>> Maybe we need three classes of "Leaf Function Latency":
>> 1. Sane
>> 2. "Longer than most"
>> 3. Better turn the NMI watchdog off before calling this. :)
>>
>> Would that help?
> 
> I'm thikning we want something along the lines of the Xen preemptible
> hypercalls, except less crazy. Where the caller does:
> 
> 	for (;;) {
> 		ret = tdcall(fn, args);
> 		if (ret == -EAGAIN) {
> 			cond_resched();
> 			continue;
> 		}
> 		break;
> 	}
> 
> And then the TDX black box provides a guarantee that any one tdcall (or
> seamcall or whatever) never takes more than X ns (possibly even
> configurable) and we get to raise a bug report if we can prove it
> actually takes longer.

It's _supposed_ to be doing something kinda like that.  For instance, in
the places that need locking, the TDX module essentially does:

	if (!trylock(&lock))
		return -EBUSY;

which is a heck of a lot better than spinning in the TDX module.  Those
module locks are also almost always for things that *also* have some
kind of concurrency control in Linux too.

*But*, there are also the really nasty calls that *do* take forever.  It
would be great to have a list of them or, heck, even *enumeration* of
which ones can take forever so we don't need to maintain a table.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-07-03 15:03                     ` Peter Zijlstra
  2023-07-03 15:26                       ` Dave Hansen
@ 2023-07-03 17:55                       ` kirill.shutemov
  2023-07-03 18:26                         ` Dave Hansen
  2023-07-05  7:14                         ` Peter Zijlstra
  1 sibling, 2 replies; 159+ messages in thread
From: kirill.shutemov @ 2023-07-03 17:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Sean Christopherson, Isaku Yamahata, Kai Huang, kvm,
	Ashok Raj, Tony Luck, david, bagasdotme, ak, Rafael J Wysocki,
	Reinette Chatre, pbonzini, mingo, tglx, linux-kernel, linux-mm,
	Isaku Yamahata, nik.borisov, hpa, Sagi Shahar, imammedo, bp,
	Chao Gao, Len Brown, sathyanarayanan.kuppuswamy, Ying Huang,
	Dan J Williams, x86

On Mon, Jul 03, 2023 at 05:03:30PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 03, 2023 at 07:40:55AM -0700, Dave Hansen wrote:
> > On 7/3/23 03:49, Peter Zijlstra wrote:
> > >> There are also latency and noisy neighbor concerns, e.g. we *really* don't want
> > >> to end up in a situation where creating a TDX guest for a customer can observe
> > >> arbitrary latency *and* potentially be disruptive to VMs already running on the
> > >> host.
> > > Well, that's a quality of implementation issue with the whole TDX
> > > crapola. Sounds like we want to impose latency constraints on the
> > > various TDX calls. Allowing it to consume arbitrary amounts of CPU time
> > > is unacceptable in any case.
> > 
> > For what it's worth, everybody knew that calling into the TDX module was
> > going to be a black hole and that consuming large amounts of CPU at
> > random times would drive people bat guano crazy.
> > 
> > The TDX Module ABI spec does have "Leaf Function Latency" warnings for
> > some of the module calls.  But, it's basically a binary thing.  A call
> > is either normal or "longer than most".
> > 
> > The majority of the "longer than most" cases are for initialization.
> > The _most_ obscene runtime ones are chunked up and can return partial
> > progress to limit latency spikes.  But I don't think folks tried as hard
> > on the initialization calls since they're only called once which
> > actually seems pretty reasonable to me.
> > 
> > Maybe we need three classes of "Leaf Function Latency":
> > 1. Sane
> > 2. "Longer than most"
> > 3. Better turn the NMI watchdog off before calling this. :)
> > 
> > Would that help?
> 
> I'm thikning we want something along the lines of the Xen preemptible
> hypercalls, except less crazy. Where the caller does:
> 
> 	for (;;) {
> 		ret = tdcall(fn, args);
> 		if (ret == -EAGAIN) {
> 			cond_resched();
> 			continue;
> 		}
> 		break;
> 	}
> 
> And then the TDX black box provides a guarantee that any one tdcall (or
> seamcall or whatever) never takes more than X ns (possibly even
> configurable) and we get to raise a bug report if we can prove it
> actually takes longer.

TDG.VP.VMCALL TDCALL can take arbitrary amount of time as it handles over
control to the host/VMM.

But I'm not quite follow how it is different from the host stopping
scheduling vCPU on a random instruction. It can happen at any point and
TDCALL is not special from this PoV.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-07-03 17:55                       ` kirill.shutemov
@ 2023-07-03 18:26                         ` Dave Hansen
  2023-07-05  7:14                         ` Peter Zijlstra
  1 sibling, 0 replies; 159+ messages in thread
From: Dave Hansen @ 2023-07-03 18:26 UTC (permalink / raw)
  To: kirill.shutemov, Peter Zijlstra
  Cc: Sean Christopherson, Isaku Yamahata, Kai Huang, kvm, Ashok Raj,
	Tony Luck, david, bagasdotme, ak, Rafael J Wysocki,
	Reinette Chatre, pbonzini, mingo, tglx, linux-kernel, linux-mm,
	Isaku Yamahata, nik.borisov, hpa, Sagi Shahar, imammedo, bp,
	Chao Gao, Len Brown, sathyanarayanan.kuppuswamy, Ying Huang,
	Dan J Williams, x86

On 7/3/23 10:55, kirill.shutemov@linux.intel.com wrote:
>> I'm thikning we want something along the lines of the Xen preemptible
>> hypercalls, except less crazy. Where the caller does:
>>
>> 	for (;;) {
>> 		ret = tdcall(fn, args);
>> 		if (ret == -EAGAIN) {
>> 			cond_resched();
>> 			continue;
>> 		}
>> 		break;
>> 	}
>>
>> And then the TDX black box provides a guarantee that any one tdcall (or
>> seamcall or whatever) never takes more than X ns (possibly even
>> configurable) and we get to raise a bug report if we can prove it
>> actually takes longer.
> TDG.VP.VMCALL TDCALL can take arbitrary amount of time as it handles over
> control to the host/VMM.
> 
> But I'm not quite follow how it is different from the host stopping
> scheduling vCPU on a random instruction. It can happen at any point and
> TDCALL is not special from this PoV.

Well, for one, if the host stops the vCPU on a random instruction the
host has to restore all the vCPU state.  *ALL* of it.  That means that
after the host hands control back, the guest is perfectly ready to take
all the interrupts that are pending.

These TDCALLs are *VERY* different.  The guest gets control back and has
some amount of its state zapped, RBP being the most annoying current
example of state that is lost.  So the guest resumes control here and
must handle all of its interrupts with some of its state (and thus
ability to cleanly handle the interrupt) gone.

The instructions after state is lost are very much special.  Just look
at the syscall gap.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 11/22] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
  2023-06-26 14:12 ` [PATCH v12 11/22] x86/virt/tdx: Fill out " Kai Huang
@ 2023-07-04  7:28   ` Yuan Yao
  0 siblings, 0 replies; 159+ messages in thread
From: Yuan Yao @ 2023-07-04  7:28 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, peterz, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:41AM +1200, Kai Huang wrote:
> Start to transit out the "multi-steps" to construct a list of "TD Memory
> Regions" (TDMRs) to cover all TDX-usable memory regions.
>
> The kernel configures TDX-usable memory regions by passing a list of
> TDMRs "TD Memory Regions" (TDMRs) to the TDX module.  Each TDMR contains
> the information of the base/size of a memory region, the base/size of the
> associated Physical Address Metadata Table (PAMT) and a list of reserved
> areas in the region.
>
> Do the first step to fill out a number of TDMRs to cover all TDX memory
> regions.  To keep it simple, always try to use one TDMR for each memory
> region.  As the first step only set up the base/size for each TDMR.
>
> Each TDMR must be 1G aligned and the size must be in 1G granularity.
> This implies that one TDMR could cover multiple memory regions.  If a
> memory region spans the 1GB boundary and the former part is already
> covered by the previous TDMR, just use a new TDMR for the remaining
> part.
>
> TDX only supports a limited number of TDMRs.  Disable TDX if all TDMRs
> are consumed but there is more memory region to cover.
>
> There are fancier things that could be done like trying to merge
> adjacent TDMRs.  This would allow more pathological memory layouts to be
> supported.  But, current systems are not even close to exhausting the
> existing TDMR resources in practice.  For now, keep it simple.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---

Reviewed-by: Yuan Yao <yuan.yao@intel.com>

>
> v11 -> v12:
>  - Improved comments around looping over TDX memblock to create TDMRs.
>    (Dave).
>  - Added code to pr_warn() when consumed TDMRs reaching maximum TDMRs
>    (Dave).
>  - BIT_ULL(30) -> SZ_1G (Kirill)
>  - Removed unused TDMR_PFN_ALIGNMENT (Sathy)
>  - Added tags from Kirill/Sathy
>
> v10 -> v11:
>  - No update
>
> v9 -> v10:
>  - No change.
>
> v8 -> v9:
>
>  - Added the last paragraph in the changelog (Dave).
>  - Removed unnecessary type cast in tdmr_entry() (Dave).
>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 103 +++++++++++++++++++++++++++++++++++-
>  arch/x86/virt/vmx/tdx/tdx.h |   3 ++
>  2 files changed, 105 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index e28615b60f9b..2ffc1517a93b 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -341,6 +341,102 @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list)
>  			tdmr_list->max_tdmrs * tdmr_list->tdmr_sz);
>  }
>
> +/* Get the TDMR from the list at the given index. */
> +static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list,
> +				    int idx)
> +{
> +	int tdmr_info_offset = tdmr_list->tdmr_sz * idx;
> +
> +	return (void *)tdmr_list->tdmrs + tdmr_info_offset;
> +}
> +
> +#define TDMR_ALIGNMENT		SZ_1G
> +#define TDMR_ALIGN_DOWN(_addr)	ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
> +#define TDMR_ALIGN_UP(_addr)	ALIGN((_addr), TDMR_ALIGNMENT)
> +
> +static inline u64 tdmr_end(struct tdmr_info *tdmr)
> +{
> +	return tdmr->base + tdmr->size;
> +}
> +
> +/*
> + * Take the memory referenced in @tmb_list and populate the
> + * preallocated @tdmr_list, following all the special alignment
> + * and size rules for TDMR.
> + */
> +static int fill_out_tdmrs(struct list_head *tmb_list,
> +			  struct tdmr_info_list *tdmr_list)
> +{
> +	struct tdx_memblock *tmb;
> +	int tdmr_idx = 0;
> +
> +	/*
> +	 * Loop over TDX memory regions and fill out TDMRs to cover them.
> +	 * To keep it simple, always try to use one TDMR to cover one
> +	 * memory region.
> +	 *
> +	 * In practice TDX supports at least 64 TDMRs.  A 2-socket system
> +	 * typically only consumes less than 10 of those.  This code is
> +	 * dumb and simple and may use more TMDRs than is strictly
> +	 * required.
> +	 */
> +	list_for_each_entry(tmb, tmb_list, list) {
> +		struct tdmr_info *tdmr = tdmr_entry(tdmr_list, tdmr_idx);
> +		u64 start, end;
> +
> +		start = TDMR_ALIGN_DOWN(PFN_PHYS(tmb->start_pfn));
> +		end   = TDMR_ALIGN_UP(PFN_PHYS(tmb->end_pfn));
> +
> +		/*
> +		 * A valid size indicates the current TDMR has already
> +		 * been filled out to cover the previous memory region(s).
> +		 */
> +		if (tdmr->size) {
> +			/*
> +			 * Loop to the next if the current memory region
> +			 * has already been fully covered.
> +			 */
> +			if (end <= tdmr_end(tdmr))
> +				continue;
> +
> +			/* Otherwise, skip the already covered part. */
> +			if (start < tdmr_end(tdmr))
> +				start = tdmr_end(tdmr);
> +
> +			/*
> +			 * Create a new TDMR to cover the current memory
> +			 * region, or the remaining part of it.
> +			 */
> +			tdmr_idx++;
> +			if (tdmr_idx >= tdmr_list->max_tdmrs) {
> +				pr_warn("initialization failed: TDMRs exhausted.\n");
> +				return -ENOSPC;
> +			}
> +
> +			tdmr = tdmr_entry(tdmr_list, tdmr_idx);
> +		}
> +
> +		tdmr->base = start;
> +		tdmr->size = end - start;
> +	}
> +
> +	/* @tdmr_idx is always the index of the last valid TDMR. */
> +	tdmr_list->nr_consumed_tdmrs = tdmr_idx + 1;
> +
> +	/*
> +	 * Warn early that kernel is about to run out of TDMRs.
> +	 *
> +	 * This is an indication that TDMR allocation has to be
> +	 * reworked to be smarter to not run into an issue.
> +	 */
> +	if (tdmr_list->max_tdmrs - tdmr_list->nr_consumed_tdmrs < TDMR_NR_WARN)
> +		pr_warn("consumed TDMRs reaching limit: %d used out of %d\n",
> +				tdmr_list->nr_consumed_tdmrs,
> +				tdmr_list->max_tdmrs);
> +
> +	return 0;
> +}
> +
>  /*
>   * Construct a list of TDMRs on the preallocated space in @tdmr_list
>   * to cover all TDX memory regions in @tmb_list based on the TDX module
> @@ -350,10 +446,15 @@ static int construct_tdmrs(struct list_head *tmb_list,
>  			   struct tdmr_info_list *tdmr_list,
>  			   struct tdsysinfo_struct *sysinfo)
>  {
> +	int ret;
> +
> +	ret = fill_out_tdmrs(tmb_list, tdmr_list);
> +	if (ret)
> +		return ret;
> +
>  	/*
>  	 * TODO:
>  	 *
> -	 *  - Fill out TDMRs to cover all TDX memory regions.
>  	 *  - Allocate and set up PAMTs for each TDMR.
>  	 *  - Designate reserved areas for each TDMR.
>  	 *
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 193764afc602..3086f7ad0522 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -123,6 +123,9 @@ struct tdx_memblock {
>  	unsigned long end_pfn;
>  };
>
> +/* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */
> +#define TDMR_NR_WARN 4
> +
>  struct tdmr_info_list {
>  	void *tdmrs;	/* Flexible array to hold 'tdmr_info's */
>  	int nr_consumed_tdmrs;	/* How many 'tdmr_info's are in use */
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2023-06-26 14:12 ` [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
  2023-06-27  9:51   ` kirill.shutemov
@ 2023-07-04  7:40   ` Yuan Yao
  2023-07-04  8:59     ` Huang, Kai
  2023-07-11 11:42   ` David Hildenbrand
  2 siblings, 1 reply; 159+ messages in thread
From: Yuan Yao @ 2023-07-04  7:40 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, peterz, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:42AM +1200, Kai Huang wrote:
> The TDX module uses additional metadata to record things like which
> guest "owns" a given page of memory.  This metadata, referred as
> Physical Address Metadata Table (PAMT), essentially serves as the
> 'struct page' for the TDX module.  PAMTs are not reserved by hardware
> up front.  They must be allocated by the kernel and then given to the
> TDX module during module initialization.
>
> TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
> (TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
> be a physically contiguous area from a Convertible Memory Region (CMR).
> However, the PAMTs which track pages in one TDMR do not need to reside
> within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
> any TDMR, the overlapping part must be reported as a reserved area in
> that particular TDMR.
>
> Use alloc_contig_pages() since PAMT must be a physically contiguous area
> and it may be potentially large (~1/256th of the size of the given TDMR).
> The downside is alloc_contig_pages() may fail at runtime.  One (bad)
> mitigation is to launch a TDX guest early during system boot to get
> those PAMTs allocated at early time, but the only way to fix is to add a
> boot option to allocate or reserve PAMTs during kernel boot.
>
> It is imperfect but will be improved on later.
>
> TDX only supports a limited number of reserved areas per TDMR to cover
> both PAMTs and memory holes within the given TDMR.  If many PAMTs are
> allocated within a single TDMR, the reserved areas may not be sufficient
> to cover all of them.
>
> Adopt the following policies when allocating PAMTs for a given TDMR:
>
>   - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
>     the total number of reserved areas consumed for PAMTs.
>   - Try to first allocate PAMT from the local node of the TDMR for better
>     NUMA locality.
>
> Also dump out how many pages are allocated for PAMTs when the TDX module
> is initialized successfully.  This helps answer the eternal "where did
> all my memory go?" questions.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
>
> v11 -> v12:
>  - Moved TDX_PS_NUM from tdx.c to <asm/tdx.h> (Kirill)
>  - "<= TDX_PS_1G" -> "< TDX_PS_NUM" (Kirill)
>  - Changed tdmr_get_pamt() to return base and size instead of base_pfn
>    and npages and related code directly (Dave).
>  - Simplified PAMT kb counting. (Dave)
>  - tdmrs_count_pamt_pages() -> tdmr_count_pamt_kb() (Kirill/Dave)
>
> v10 -> v11:
>  - No update
>
> v9 -> v10:
>  - Removed code change in disable_tdx_module() as it doesn't exist
>    anymore.
>
> v8 -> v9:
>  - Added TDX_PS_NR macro instead of open-coding (Dave).
>  - Better alignment of 'pamt_entry_size' in tdmr_set_up_pamt() (Dave).
>  - Changed to print out PAMTs in "KBs" instead of "pages" (Dave).
>  - Added Dave's Reviewed-by.
>
> v7 -> v8: (Dave)
>  - Changelog:
>   - Added a sentence to state PAMT allocation will be improved.
>   - Others suggested by Dave.
>  - Moved 'nid' of 'struct tdx_memblock' to this patch.
>  - Improved comments around tdmr_get_nid().
>  - WARN_ON_ONCE() -> pr_warn() in tdmr_get_nid().
>  - Other changes due to 'struct tdmr_info_list'.
>
> v6 -> v7:
>  - Changes due to using macros instead of 'enum' for TDX supported page
>    sizes.
>
> v5 -> v6:
>  - Rebase due to using 'tdx_memblock' instead of memblock.
>  - 'int pamt_entry_nr' -> 'unsigned long nr_pamt_entries' (Dave/Sagis).
>  - Improved comment around tdmr_get_nid() (Dave).
>  - Improved comment in tdmr_set_up_pamt() around breaking the PAMT
>    into PAMTs for 4K/2M/1G (Dave).
>  - tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave).
>
> - v3 -> v5 (no feedback on v4):
>  - Used memblock to get the NUMA node for given TDMR.
>  - Removed tdmr_get_pamt_sz() helper but use open-code instead.
>  - Changed to use 'switch .. case..' for each TDX supported page size in
>    tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()).
>  - Added printing out memory used for PAMT allocation when TDX module is
>    initialized successfully.
>  - Explained downside of alloc_contig_pages() in changelog.
>  - Addressed other minor comments.
>
>
> ---
>  arch/x86/Kconfig            |   1 +
>  arch/x86/include/asm/tdx.h  |   1 +
>  arch/x86/virt/vmx/tdx/tdx.c | 215 +++++++++++++++++++++++++++++++++++-
>  arch/x86/virt/vmx/tdx/tdx.h |   1 +
>  4 files changed, 213 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2226d8a4c749..ad364f01de33 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
>  	depends on KVM_INTEL
>  	depends on X86_X2APIC
>  	select ARCH_KEEP_MEMBLOCK
> +	depends on CONTIG_ALLOC
>  	help
>  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>  	  host and certain physical attacks.  This option enables necessary TDX
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index d8226a50c58c..91416fd600cd 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -24,6 +24,7 @@
>  #define TDX_PS_4K	0
>  #define TDX_PS_2M	1
>  #define TDX_PS_1G	2
> +#define TDX_PS_NR	(TDX_PS_1G + 1)
>
>  /*
>   * Used to gather the output registers values of the TDCALL and SEAMCALL
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 2ffc1517a93b..fd5417577f26 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -221,7 +221,7 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
>   * overlap.
>   */
>  static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> -			    unsigned long end_pfn)
> +			    unsigned long end_pfn, int nid)
>  {
>  	struct tdx_memblock *tmb;
>
> @@ -232,6 +232,7 @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
>  	INIT_LIST_HEAD(&tmb->list);
>  	tmb->start_pfn = start_pfn;
>  	tmb->end_pfn = end_pfn;
> +	tmb->nid = nid;
>
>  	/* @tmb_list is protected by mem_hotplug_lock */
>  	list_add_tail(&tmb->list, tmb_list);
> @@ -259,9 +260,9 @@ static void free_tdx_memlist(struct list_head *tmb_list)
>  static int build_tdx_memlist(struct list_head *tmb_list)
>  {
>  	unsigned long start_pfn, end_pfn;
> -	int i, ret;
> +	int i, nid, ret;
>
> -	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
>  		/*
>  		 * The first 1MB is not reported as TDX convertible memory.
>  		 * Although the first 1MB is always reserved and won't end up
> @@ -277,7 +278,7 @@ static int build_tdx_memlist(struct list_head *tmb_list)
>  		 * memblock has already guaranteed they are in address
>  		 * ascending order and don't overlap.
>  		 */
> -		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
> +		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
>  		if (ret)
>  			goto err;
>  	}
> @@ -437,6 +438,202 @@ static int fill_out_tdmrs(struct list_head *tmb_list,
>  	return 0;
>  }
>
> +/*
> + * Calculate PAMT size given a TDMR and a page size.  The returned
> + * PAMT size is always aligned up to 4K page boundary.
> + */
> +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz,
> +				      u16 pamt_entry_size)
> +{
> +	unsigned long pamt_sz, nr_pamt_entries;
> +
> +	switch (pgsz) {
> +	case TDX_PS_4K:
> +		nr_pamt_entries = tdmr->size >> PAGE_SHIFT;
> +		break;
> +	case TDX_PS_2M:
> +		nr_pamt_entries = tdmr->size >> PMD_SHIFT;
> +		break;
> +	case TDX_PS_1G:
> +		nr_pamt_entries = tdmr->size >> PUD_SHIFT;
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		return 0;
> +	}
> +
> +	pamt_sz = nr_pamt_entries * pamt_entry_size;
> +	/* TDX requires PAMT size must be 4K aligned */
> +	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
> +
> +	return pamt_sz;
> +}
> +
> +/*
> + * Locate a NUMA node which should hold the allocation of the @tdmr
> + * PAMT.  This node will have some memory covered by the TDMR.  The
> + * relative amount of memory covered is not considered.
> + */
> +static int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_list)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	/*
> +	 * A TDMR must cover at least part of one TMB.  That TMB will end
> +	 * after the TDMR begins.  But, that TMB may have started before
> +	 * the TDMR.  Find the next 'tmb' that _ends_ after this TDMR
> +	 * begins.  Ignore 'tmb' start addresses.  They are irrelevant.
> +	 */
> +	list_for_each_entry(tmb, tmb_list, list) {
> +		if (tmb->end_pfn > PHYS_PFN(tdmr->base))
> +			return tmb->nid;
> +	}
> +
> +	/*
> +	 * Fall back to allocating the TDMR's metadata from node 0 when
> +	 * no TDX memory block can be found.  This should never happen
> +	 * since TDMRs originate from TDX memory blocks.
> +	 */
> +	pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT allocation, fallback to use node 0.\n",
> +			tdmr->base, tdmr_end(tdmr));
> +	return 0;
> +}
> +
> +/*
> + * Allocate PAMTs from the local NUMA node of some memory in @tmb_list
> + * within @tdmr, and set up PAMTs for @tdmr.
> + */
> +static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> +			    struct list_head *tmb_list,
> +			    u16 pamt_entry_size)
> +{
> +	unsigned long pamt_base[TDX_PS_NR];
> +	unsigned long pamt_size[TDX_PS_NR];
> +	unsigned long tdmr_pamt_base;
> +	unsigned long tdmr_pamt_size;
> +	struct page *pamt;
> +	int pgsz, nid;
> +
> +	nid = tdmr_get_nid(tdmr, tmb_list);
> +
> +	/*
> +	 * Calculate the PAMT size for each TDX supported page size
> +	 * and the total PAMT size.
> +	 */
> +	tdmr_pamt_size = 0;
> +	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR ; pgsz++) {
                                           ^
Please remove the additional space.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>

> +		pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz,
> +					pamt_entry_size);
> +		tdmr_pamt_size += pamt_size[pgsz];
> +	}
> +
> +	/*
> +	 * Allocate one chunk of physically contiguous memory for all
> +	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
> +	 * in overlapped TDMRs.
> +	 */
> +	pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL,
> +			nid, &node_online_map);
> +	if (!pamt)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Break the contiguous allocation back up into the
> +	 * individual PAMTs for each page size.
> +	 */
> +	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
> +	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) {
> +		pamt_base[pgsz] = tdmr_pamt_base;
> +		tdmr_pamt_base += pamt_size[pgsz];
> +	}
> +
> +	tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> +	tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
> +	tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
> +	tdmr->pamt_2m_size = pamt_size[TDX_PS_2M];
> +	tdmr->pamt_1g_base = pamt_base[TDX_PS_1G];
> +	tdmr->pamt_1g_size = pamt_size[TDX_PS_1G];
> +
> +	return 0;
> +}
> +
> +static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base,
> +			  unsigned long *pamt_size)
> +{
> +	unsigned long pamt_bs, pamt_sz;
> +
> +	/*
> +	 * The PAMT was allocated in one contiguous unit.  The 4K PAMT
> +	 * should always point to the beginning of that allocation.
> +	 */
> +	pamt_bs = tdmr->pamt_4k_base;
> +	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
> +
> +	WARN_ON_ONCE((pamt_bs & ~PAGE_MASK) || (pamt_sz & ~PAGE_MASK));
> +
> +	*pamt_base = pamt_bs;
> +	*pamt_size = pamt_sz;
> +}
> +
> +static void tdmr_free_pamt(struct tdmr_info *tdmr)
> +{
> +	unsigned long pamt_base, pamt_size;
> +
> +	tdmr_get_pamt(tdmr, &pamt_base, &pamt_size);
> +
> +	/* Do nothing if PAMT hasn't been allocated for this TDMR */
> +	if (!pamt_size)
> +		return;
> +
> +	if (WARN_ON_ONCE(!pamt_base))
> +		return;
> +
> +	free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT);
> +}
> +
> +static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
> +{
> +	int i;
> +
> +	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
> +		tdmr_free_pamt(tdmr_entry(tdmr_list, i));
> +}
> +
> +/* Allocate and set up PAMTs for all TDMRs */
> +static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
> +				 struct list_head *tmb_list,
> +				 u16 pamt_entry_size)
> +{
> +	int i, ret = 0;
> +
> +	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> +		ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list,
> +				pamt_entry_size);
> +		if (ret)
> +			goto err;
> +	}
> +
> +	return 0;
> +err:
> +	tdmrs_free_pamt_all(tdmr_list);
> +	return ret;
> +}
> +
> +static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
> +{
> +	unsigned long pamt_size = 0;
> +	int i;
> +
> +	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> +		unsigned long base, size;
> +
> +		tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
> +		pamt_size += size;
> +	}
> +
> +	return pamt_size / 1024;
> +}
> +
>  /*
>   * Construct a list of TDMRs on the preallocated space in @tdmr_list
>   * to cover all TDX memory regions in @tmb_list based on the TDX module
> @@ -452,10 +649,13 @@ static int construct_tdmrs(struct list_head *tmb_list,
>  	if (ret)
>  		return ret;
>
> +	ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list,
> +			sysinfo->pamt_entry_size);
> +	if (ret)
> +		return ret;
>  	/*
>  	 * TODO:
>  	 *
> -	 *  - Allocate and set up PAMTs for each TDMR.
>  	 *  - Designate reserved areas for each TDMR.
>  	 *
>  	 * Return -EINVAL until constructing TDMRs is done
> @@ -526,6 +726,11 @@ static int init_tdx_module(void)
>  	 *  Return error before all steps are done.
>  	 */
>  	ret = -EINVAL;
> +	if (ret)
> +		tdmrs_free_pamt_all(&tdmr_list);
> +	else
> +		pr_info("%lu KBs allocated for PAMT.\n",
> +				tdmrs_count_pamt_kb(&tdmr_list));
>  out_free_tdmrs:
>  	/*
>  	 * Always free the buffer of TDMRs as they are only used during
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 3086f7ad0522..9b5a65f37e8b 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -121,6 +121,7 @@ struct tdx_memblock {
>  	struct list_head list;
>  	unsigned long start_pfn;
>  	unsigned long end_pfn;
> +	int nid;
>  };
>
>  /* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2023-07-04  7:40   ` Yuan Yao
@ 2023-07-04  8:59     ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-07-04  8:59 UTC (permalink / raw)
  To: yuan.yao
  Cc: kvm, Raj, Ashok, Hansen, Dave, david, bagasdotme, Luck, Tony, ak,
	Wysocki, Rafael J, linux-kernel, Christopherson,,
	Sean, mingo, pbonzini, linux-mm, tglx, kirill.shutemov, Chatre,
	Reinette, Yamahata, Isaku, nik.borisov, hpa, peterz, Shahar,
	Sagi, imammedo, bp, Gao, Chao, Brown, Len,
	sathyanarayanan.kuppuswamy, Huang, Ying, Williams, Dan J, x86

On Tue, 2023-07-04 at 15:40 +0800, Yuan Yao wrote:
> > +	for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR ; pgsz++) {
>                                            ^
> Please remove the additional space.
> 
> Reviewed-by: Yuan Yao <yuan.yao@intel.com>

Appreciate!

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-06-30 21:24               ` Sean Christopherson
                                   ` (2 preceding siblings ...)
  2023-07-03 10:49                 ` Peter Zijlstra
@ 2023-07-04 16:58                 ` Peter Zijlstra
  2023-07-04 21:50                   ` Huang, Kai
  2023-07-05 14:34                   ` Dave Hansen
  3 siblings, 2 replies; 159+ messages in thread
From: Peter Zijlstra @ 2023-07-04 16:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Kai Huang, kvm, Ashok Raj, Tony Luck, david,
	bagasdotme, Dave Hansen, ak, Rafael J Wysocki, kirill.shutemov,
	Reinette Chatre, pbonzini, mingo, tglx, linux-kernel, linux-mm,
	Isaku Yamahata, nik.borisov, hpa, Sagi Shahar, imammedo, bp,
	Chao Gao, Len Brown, sathyanarayanan.kuppuswamy, Ying Huang,
	Dan J Williams, x86

On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote:

> Waiting until userspace attempts to create the first TDX guest adds complexity
> and limits what KVM can do to harden itself.  Currently, all feature support in
> KVM is effectively frozen at module load.  E.g. most of the setup code is
> contained in __init functions, many module-scoped variables are effectively 
> RO after init (though they can't be marked as such until we smush kvm-intel.ko
> and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan).  All of
> those patterns would get tossed aside if KVM waits until userspace attempts to
> create the first guest.

....

People got poked and the following was suggested:

On boot do:

 TDH.SYS.INIT
 TDH.SYS.LP.INIT
 TDH.SYS.CONFIG
 TDH.SYS.KEY.CONFIG

This should get TDX mostly sorted, but doesn't consume much resources.
Then later, when starting the first TDX guest, do the whole

 TDH.TDMR.INIT

dance to set up the PAMT array -- which is what gobbles up memory. From
what I understand the TDH.TDMR.INIT thing is not one of those
excessively long calls.

If we have concerns about allocating the PAMT array, can't we use CMA
for this? Allocate the whole thing at boot as CMA such that when not
used for TDX it can be used for regular things like userspace and
filecache pages?

Those TDH.SYS calls should be enough to ensure TDX is actually working,
no?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-07-04 16:58                 ` Peter Zijlstra
@ 2023-07-04 21:50                   ` Huang, Kai
  2023-07-05  7:16                     ` Peter Zijlstra
  2023-07-05 14:34                   ` Dave Hansen
  1 sibling, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-07-04 21:50 UTC (permalink / raw)
  To: peterz, Christopherson,, Sean
  Cc: kvm, x86, Raj, Ashok, Hansen, Dave, david, bagasdotme, ak,
	Wysocki, Rafael J, linux-kernel, Chatre, Reinette, mingo,
	kirill.shutemov, tglx, linux-mm, pbonzini, nik.borisov, Yamahata,
	Isaku, Luck, Tony, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	isaku.yamahata, Brown, Len, sathyanarayanan.kuppuswamy, Huang,
	Ying, Williams, Dan J

On Tue, 2023-07-04 at 18:58 +0200, Peter Zijlstra wrote:
> On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote:
> 
> > Waiting until userspace attempts to create the first TDX guest adds complexity
> > and limits what KVM can do to harden itself.  Currently, all feature support in
> > KVM is effectively frozen at module load.  E.g. most of the setup code is
> > contained in __init functions, many module-scoped variables are effectively 
> > RO after init (though they can't be marked as such until we smush kvm-intel.ko
> > and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan).  All of
> > those patterns would get tossed aside if KVM waits until userspace attempts to
> > create the first guest.
> 
> ....
> 
> People got poked and the following was suggested:
> 
> On boot do:
> 
>  TDH.SYS.INIT
>  TDH.SYS.LP.INIT
>  TDH.SYS.CONFIG
>  TDH.SYS.KEY.CONFIG
> 
> This should get TDX mostly sorted, but doesn't consume much resources.
> Then later, when starting the first TDX guest, do the whole
> 
>  TDH.TDMR.INIT
> 
> dance to set up the PAMT array -- which is what gobbles up memory. From
> what I understand the TDH.TDMR.INIT thing is not one of those
> excessively long calls.

The TDH.TDMR.INIT itself has it's own latency requirement implemented in the TDX
module, thus it only initializes a small chunk  (1M I guess) in each call. 
Therefore we need a loop to do bunch of TDH.TDMR.INIT in order to initialize all
PAMT entries for all TDX-usable memory, which can be time-consuming.

Currently for simplicity we just do this inside the module initialization, but
can be optimized later when we have an agreed solution of how to optimize.

> 
> If we have concerns about allocating the PAMT array, can't we use CMA
> for this? Allocate the whole thing at boot as CMA such that when not
> used for TDX it can be used for regular things like userspace and
> filecache pages?

The PAMT allocation itself isn't a concern I think.  The concern is the
TDH.TDMR.INIT to initialize them.

Also, one practical problem to prevent us from pre-allocating PAMT is the PAMT
size to be allocated can only be determined after the TDH.SYS.INFO SEAMCALL,
which reports the "PAMT entry size" in the TDSYSINFO_STRUCT.

> 
> Those TDH.SYS calls should be enough to ensure TDX is actually working,
> no?


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 13/22] x86/virt/tdx: Designate reserved areas for all TDMRs
  2023-06-26 14:12 ` [PATCH v12 13/22] x86/virt/tdx: Designate reserved areas for all TDMRs Kai Huang
@ 2023-07-05  5:29   ` Yuan Yao
  0 siblings, 0 replies; 159+ messages in thread
From: Yuan Yao @ 2023-07-05  5:29 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, peterz, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:43AM +1200, Kai Huang wrote:
> As the last step of constructing TDMRs, populate reserved areas for all
> TDMRs.  For each TDMR, put all memory holes within this TDMR to the
> reserved areas.  And for all PAMTs which overlap with this TDMR, put
> all the overlapping parts to reserved areas too.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>

>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>
> v11 -> v12:
>  - Code change due to tdmr_get_pamt() change from returning pfn/npages to
>    base/size
>  - Added Kirill's tag
>
> v10 -> v11:
>  - No update
>
> v9 -> v10:
>  - No change.
>
> v8 -> v9:
>  - Added comment around 'tdmr_add_rsvd_area()' to point out it doesn't do
>    optimization to save reserved areas. (Dave).
>
> v7 -> v8: (Dave)
>  - "set_up" -> "populate" in function name change (Dave).
>  - Improved comment suggested by Dave.
>  - Other changes due to 'struct tdmr_info_list'.
>
> v6 -> v7:
>  - No change.
>
> v5 -> v6:
>  - Rebase due to using 'tdx_memblock' instead of memblock.
>  - Split tdmr_set_up_rsvd_areas() into two functions to handle memory
>    hole and PAMT respectively.
>  - Added Isaku's Reviewed-by.
>
>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 217 ++++++++++++++++++++++++++++++++++--
>  1 file changed, 209 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index fd5417577f26..2bcace5cb25c 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -25,6 +25,7 @@
>  #include <linux/sizes.h>
>  #include <linux/pfn.h>
>  #include <linux/align.h>
> +#include <linux/sort.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/archrandom.h>
> @@ -634,6 +635,207 @@ static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
>  	return pamt_size / 1024;
>  }
>
> +static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
> +			      u64 size, u16 max_reserved_per_tdmr)
> +{
> +	struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
> +	int idx = *p_idx;
> +
> +	/* Reserved area must be 4K aligned in offset and size */
> +	if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
> +		return -EINVAL;
> +
> +	if (idx >= max_reserved_per_tdmr) {
> +		pr_warn("initialization failed: TDMR [0x%llx, 0x%llx): reserved areas exhausted.\n",
> +				tdmr->base, tdmr_end(tdmr));
> +		return -ENOSPC;
> +	}
> +
> +	/*
> +	 * Consume one reserved area per call.  Make no effort to
> +	 * optimize or reduce the number of reserved areas which are
> +	 * consumed by contiguous reserved areas, for instance.
> +	 */
> +	rsvd_areas[idx].offset = addr - tdmr->base;
> +	rsvd_areas[idx].size = size;
> +
> +	*p_idx = idx + 1;
> +
> +	return 0;
> +}
> +
> +/*
> + * Go through @tmb_list to find holes between memory areas.  If any of
> + * those holes fall within @tdmr, set up a TDMR reserved area to cover
> + * the hole.
> + */
> +static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
> +				    struct tdmr_info *tdmr,
> +				    int *rsvd_idx,
> +				    u16 max_reserved_per_tdmr)
> +{
> +	struct tdx_memblock *tmb;
> +	u64 prev_end;
> +	int ret;
> +
> +	/*
> +	 * Start looking for reserved blocks at the
> +	 * beginning of the TDMR.
> +	 */
> +	prev_end = tdmr->base;
> +	list_for_each_entry(tmb, tmb_list, list) {
> +		u64 start, end;
> +
> +		start = PFN_PHYS(tmb->start_pfn);
> +		end   = PFN_PHYS(tmb->end_pfn);
> +
> +		/* Break if this region is after the TDMR */
> +		if (start >= tdmr_end(tdmr))
> +			break;
> +
> +		/* Exclude regions before this TDMR */
> +		if (end < tdmr->base)
> +			continue;
> +
> +		/*
> +		 * Skip over memory areas that
> +		 * have already been dealt with.
> +		 */
> +		if (start <= prev_end) {
> +			prev_end = end;
> +			continue;
> +		}
> +
> +		/* Add the hole before this region */
> +		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
> +				start - prev_end,
> +				max_reserved_per_tdmr);
> +		if (ret)
> +			return ret;
> +
> +		prev_end = end;
> +	}
> +
> +	/* Add the hole after the last region if it exists. */
> +	if (prev_end < tdmr_end(tdmr)) {
> +		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end,
> +				tdmr_end(tdmr) - prev_end,
> +				max_reserved_per_tdmr);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Go through @tdmr_list to find all PAMTs.  If any of those PAMTs
> + * overlaps with @tdmr, set up a TDMR reserved area to cover the
> + * overlapping part.
> + */
> +static int tdmr_populate_rsvd_pamts(struct tdmr_info_list *tdmr_list,
> +				    struct tdmr_info *tdmr,
> +				    int *rsvd_idx,
> +				    u16 max_reserved_per_tdmr)
> +{
> +	int i, ret;
> +
> +	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> +		struct tdmr_info *tmp = tdmr_entry(tdmr_list, i);
> +		unsigned long pamt_base, pamt_size, pamt_end;
> +
> +		tdmr_get_pamt(tmp, &pamt_base, &pamt_size);
> +		/* Each TDMR must already have PAMT allocated */
> +		WARN_ON_ONCE(!pamt_size|| !pamt_base);
> +
> +		pamt_end = pamt_base + pamt_size;
> +		/* Skip PAMTs outside of the given TDMR */
> +		if ((pamt_end <= tdmr->base) ||
> +				(pamt_base >= tdmr_end(tdmr)))
> +			continue;
> +
> +		/* Only mark the part within the TDMR as reserved */
> +		if (pamt_base < tdmr->base)
> +			pamt_base = tdmr->base;
> +		if (pamt_end > tdmr_end(tdmr))
> +			pamt_end = tdmr_end(tdmr);
> +
> +		ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_base,
> +				pamt_end - pamt_base,
> +				max_reserved_per_tdmr);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +/* Compare function called by sort() for TDMR reserved areas */
> +static int rsvd_area_cmp_func(const void *a, const void *b)
> +{
> +	struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a;
> +	struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b;
> +
> +	if (r1->offset + r1->size <= r2->offset)
> +		return -1;
> +	if (r1->offset >= r2->offset + r2->size)
> +		return 1;
> +
> +	/* Reserved areas cannot overlap.  The caller must guarantee. */
> +	WARN_ON_ONCE(1);
> +	return -1;
> +}
> +
> +/*
> + * Populate reserved areas for the given @tdmr, including memory holes
> + * (via @tmb_list) and PAMTs (via @tdmr_list).
> + */
> +static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
> +				    struct list_head *tmb_list,
> +				    struct tdmr_info_list *tdmr_list,
> +				    u16 max_reserved_per_tdmr)
> +{
> +	int ret, rsvd_idx = 0;
> +
> +	ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx,
> +			max_reserved_per_tdmr);
> +	if (ret)
> +		return ret;
> +
> +	ret = tdmr_populate_rsvd_pamts(tdmr_list, tdmr, &rsvd_idx,
> +			max_reserved_per_tdmr);
> +	if (ret)
> +		return ret;
> +
> +	/* TDX requires reserved areas listed in address ascending order */
> +	sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
> +			rsvd_area_cmp_func, NULL);
> +
> +	return 0;
> +}
> +
> +/*
> + * Populate reserved areas for all TDMRs in @tdmr_list, including memory
> + * holes (via @tmb_list) and PAMTs.
> + */
> +static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
> +					 struct list_head *tmb_list,
> +					 u16 max_reserved_per_tdmr)
> +{
> +	int i;
> +
> +	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> +		int ret;
> +
> +		ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i),
> +				tmb_list, tdmr_list, max_reserved_per_tdmr);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * Construct a list of TDMRs on the preallocated space in @tdmr_list
>   * to cover all TDX memory regions in @tmb_list based on the TDX module
> @@ -653,14 +855,13 @@ static int construct_tdmrs(struct list_head *tmb_list,
>  			sysinfo->pamt_entry_size);
>  	if (ret)
>  		return ret;
> -	/*
> -	 * TODO:
> -	 *
> -	 *  - Designate reserved areas for each TDMR.
> -	 *
> -	 * Return -EINVAL until constructing TDMRs is done
> -	 */
> -	return -EINVAL;
> +
> +	ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
> +			sysinfo->max_reserved_per_tdmr);
> +	if (ret)
> +		tdmrs_free_pamt_all(tdmr_list);
> +
> +	return ret;
>  }
>
>  static int init_tdx_module(void)
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 14/22] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
  2023-06-26 14:12 ` [PATCH v12 14/22] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID Kai Huang
@ 2023-07-05  6:49   ` Yuan Yao
  0 siblings, 0 replies; 159+ messages in thread
From: Yuan Yao @ 2023-07-05  6:49 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, peterz, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:44AM +1200, Kai Huang wrote:
> The TDX module uses a private KeyID as the "global KeyID" for mapping
> things like the PAMT and other TDX metadata.  This KeyID has already
> been reserved when detecting TDX during the kernel early boot.
>
> After the list of "TD Memory Regions" (TDMRs) has been constructed to
> cover all TDX-usable memory regions, the next step is to pass them to
> the TDX module together with the global KeyID.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>

>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>
> v11 -> v12:
>  - Added Kirill's tag
>
> v10 -> v11:
>  - No update
>
> v9 -> v10:
>  - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'.
>
> v8 -> v9:
>  - Improved changlog to explain why initializing TDMRs can take long
>    time (Dave).
>  - Improved comments around 'next-to-initialize' address (Dave).
>
> v7 -> v8: (Dave)
>  - Changelog:
>    - explicitly call out this is the last step of TDX module initialization.
>    - Trimed down changelog by removing SEAMCALL name and details.
>  - Removed/trimmed down unnecessary comments.
>  - Other changes due to 'struct tdmr_info_list'.
>
> v6 -> v7:
>  - Removed need_resched() check. -- Andi.
>
>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 41 ++++++++++++++++++++++++++++++++++++-
>  arch/x86/virt/vmx/tdx/tdx.h |  2 ++
>  2 files changed, 42 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 2bcace5cb25c..1992245290de 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -26,6 +26,7 @@
>  #include <linux/pfn.h>
>  #include <linux/align.h>
>  #include <linux/sort.h>
> +#include <linux/log2.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/archrandom.h>
> @@ -864,6 +865,39 @@ static int construct_tdmrs(struct list_head *tmb_list,
>  	return ret;
>  }
>
> +static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
> +{
> +	u64 *tdmr_pa_array;
> +	size_t array_sz;
> +	int i, ret;
> +
> +	/*
> +	 * TDMRs are passed to the TDX module via an array of physical
> +	 * addresses of each TDMR.  The array itself also has certain
> +	 * alignment requirement.
> +	 */
> +	array_sz = tdmr_list->nr_consumed_tdmrs * sizeof(u64);
> +	array_sz = roundup_pow_of_two(array_sz);
> +	if (array_sz < TDMR_INFO_PA_ARRAY_ALIGNMENT)
> +		array_sz = TDMR_INFO_PA_ARRAY_ALIGNMENT;
> +
> +	tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);
> +	if (!tdmr_pa_array)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
> +		tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i));
> +
> +	ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array),
> +				tdmr_list->nr_consumed_tdmrs,
> +				global_keyid, 0, NULL, NULL);
> +
> +	/* Free the array as it is not required anymore. */
> +	kfree(tdmr_pa_array);
> +
> +	return ret;
> +}
> +
>  static int init_tdx_module(void)
>  {
>  	struct tdsysinfo_struct *sysinfo;
> @@ -917,16 +951,21 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out_free_tdmrs;
>
> +	/* Pass the TDMRs and the global KeyID to the TDX module */
> +	ret = config_tdx_module(&tdmr_list, tdx_global_keyid);
> +	if (ret)
> +		goto out_free_pamts;
> +
>  	/*
>  	 * TODO:
>  	 *
> -	 *  - Configure the TDMRs and the global KeyID to the TDX module.
>  	 *  - Configure the global KeyID on all packages.
>  	 *  - Initialize all TDMRs.
>  	 *
>  	 *  Return error before all steps are done.
>  	 */
>  	ret = -EINVAL;
> +out_free_pamts:
>  	if (ret)
>  		tdmrs_free_pamt_all(&tdmr_list);
>  	else
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 9b5a65f37e8b..c386aa3afe2a 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -24,6 +24,7 @@
>  #define TDH_SYS_INFO		32
>  #define TDH_SYS_INIT		33
>  #define TDH_SYS_LP_INIT		35
> +#define TDH_SYS_CONFIG		45
>
>  struct cmr_info {
>  	u64	base;
> @@ -88,6 +89,7 @@ struct tdmr_reserved_area {
>  } __packed;
>
>  #define TDMR_INFO_ALIGNMENT	512
> +#define TDMR_INFO_PA_ARRAY_ALIGNMENT	512
>
>  struct tdmr_info {
>  	u64 base;
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-07-03 17:55                       ` kirill.shutemov
  2023-07-03 18:26                         ` Dave Hansen
@ 2023-07-05  7:14                         ` Peter Zijlstra
  1 sibling, 0 replies; 159+ messages in thread
From: Peter Zijlstra @ 2023-07-05  7:14 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: Dave Hansen, Sean Christopherson, Isaku Yamahata, Kai Huang, kvm,
	Ashok Raj, Tony Luck, david, bagasdotme, ak, Rafael J Wysocki,
	Reinette Chatre, pbonzini, mingo, tglx, linux-kernel, linux-mm,
	Isaku Yamahata, nik.borisov, hpa, Sagi Shahar, imammedo, bp,
	Chao Gao, Len Brown, sathyanarayanan.kuppuswamy, Ying Huang,
	Dan J Williams, x86

On Mon, Jul 03, 2023 at 08:55:56PM +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, Jul 03, 2023 at 05:03:30PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 03, 2023 at 07:40:55AM -0700, Dave Hansen wrote:
> > > On 7/3/23 03:49, Peter Zijlstra wrote:
> > > >> There are also latency and noisy neighbor concerns, e.g. we *really* don't want
> > > >> to end up in a situation where creating a TDX guest for a customer can observe
> > > >> arbitrary latency *and* potentially be disruptive to VMs already running on the
> > > >> host.
> > > > Well, that's a quality of implementation issue with the whole TDX
> > > > crapola. Sounds like we want to impose latency constraints on the
> > > > various TDX calls. Allowing it to consume arbitrary amounts of CPU time
> > > > is unacceptable in any case.
> > > 
> > > For what it's worth, everybody knew that calling into the TDX module was
> > > going to be a black hole and that consuming large amounts of CPU at
> > > random times would drive people bat guano crazy.
> > > 
> > > The TDX Module ABI spec does have "Leaf Function Latency" warnings for
> > > some of the module calls.  But, it's basically a binary thing.  A call
> > > is either normal or "longer than most".
> > > 
> > > The majority of the "longer than most" cases are for initialization.
> > > The _most_ obscene runtime ones are chunked up and can return partial
> > > progress to limit latency spikes.  But I don't think folks tried as hard
> > > on the initialization calls since they're only called once which
> > > actually seems pretty reasonable to me.
> > > 
> > > Maybe we need three classes of "Leaf Function Latency":
> > > 1. Sane
> > > 2. "Longer than most"
> > > 3. Better turn the NMI watchdog off before calling this. :)
> > > 
> > > Would that help?
> > 
> > I'm thikning we want something along the lines of the Xen preemptible
> > hypercalls, except less crazy. Where the caller does:
> > 
> > 	for (;;) {
> > 		ret = tdcall(fn, args);
> > 		if (ret == -EAGAIN) {
> > 			cond_resched();
> > 			continue;
> > 		}
> > 		break;
> > 	}
> > 
> > And then the TDX black box provides a guarantee that any one tdcall (or
> > seamcall or whatever) never takes more than X ns (possibly even
> > configurable) and we get to raise a bug report if we can prove it
> > actually takes longer.
> 
> TDG.VP.VMCALL TDCALL can take arbitrary amount of time as it handles over
> control to the host/VMM.
> 
> But I'm not quite follow how it is different from the host stopping
> scheduling vCPU on a random instruction. It can happen at any point and
> TDCALL is not special from this PoV.

A guest will exit on timer/interrupt and then the host can reschedule;
AFAIU this doesn't actually happen with these TDX calls, if control is
in that SEAM thing, it stays there until it's done.


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-07-04 21:50                   ` Huang, Kai
@ 2023-07-05  7:16                     ` Peter Zijlstra
  2023-07-05  7:54                       ` Huang, Kai
  0 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-07-05  7:16 UTC (permalink / raw)
  To: Huang, Kai
  Cc: Christopherson,,
	Sean, kvm, x86, Raj, Ashok, Hansen, Dave, david, bagasdotme, ak,
	Wysocki, Rafael J, linux-kernel, Chatre, Reinette, mingo,
	kirill.shutemov, tglx, linux-mm, pbonzini, nik.borisov, Yamahata,
	Isaku, Luck, Tony, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	isaku.yamahata, Brown, Len, sathyanarayanan.kuppuswamy, Huang,
	Ying, Williams, Dan J

On Tue, Jul 04, 2023 at 09:50:22PM +0000, Huang, Kai wrote:
> On Tue, 2023-07-04 at 18:58 +0200, Peter Zijlstra wrote:
> > On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote:
> > 
> > > Waiting until userspace attempts to create the first TDX guest adds complexity
> > > and limits what KVM can do to harden itself.  Currently, all feature support in
> > > KVM is effectively frozen at module load.  E.g. most of the setup code is
> > > contained in __init functions, many module-scoped variables are effectively 
> > > RO after init (though they can't be marked as such until we smush kvm-intel.ko
> > > and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan).  All of
> > > those patterns would get tossed aside if KVM waits until userspace attempts to
> > > create the first guest.
> > 
> > ....
> > 
> > People got poked and the following was suggested:
> > 
> > On boot do:
> > 
> >  TDH.SYS.INIT
> >  TDH.SYS.LP.INIT
> >  TDH.SYS.CONFIG
> >  TDH.SYS.KEY.CONFIG
> > 
> > This should get TDX mostly sorted, but doesn't consume much resources.
> > Then later, when starting the first TDX guest, do the whole
> > 
> >  TDH.TDMR.INIT
> > 
> > dance to set up the PAMT array -- which is what gobbles up memory. From
> > what I understand the TDH.TDMR.INIT thing is not one of those
> > excessively long calls.
> 
> The TDH.TDMR.INIT itself has it's own latency requirement implemented in the TDX
> module, thus it only initializes a small chunk  (1M I guess) in each call. 
> Therefore we need a loop to do bunch of TDH.TDMR.INIT in order to initialize all
> PAMT entries for all TDX-usable memory, which can be time-consuming.

Yeah, so you can put a cond_resched() in that loop and all is well, you
do not negatively affect other tasks. Because *that* was the concern
raised.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-07-05  7:16                     ` Peter Zijlstra
@ 2023-07-05  7:54                       ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-07-05  7:54 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Brown, Len, Raj, Ashok, Huang, Ying, Hansen, Dave, david,
	bagasdotme, ak, Wysocki, Rafael J, linux-kernel, Chatre,
	Reinette, Christopherson,,
	Sean, pbonzini, tglx, linux-mm, kirill.shutemov, mingo, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Luck, Tony,
	isaku.yamahata, Gao, Chao, sathyanarayanan.kuppuswamy, x86,
	Williams, Dan J

On Wed, 2023-07-05 at 09:16 +0200, Peter Zijlstra wrote:
> On Tue, Jul 04, 2023 at 09:50:22PM +0000, Huang, Kai wrote:
> > On Tue, 2023-07-04 at 18:58 +0200, Peter Zijlstra wrote:
> > > On Fri, Jun 30, 2023 at 02:24:56PM -0700, Sean Christopherson wrote:
> > > 
> > > > Waiting until userspace attempts to create the first TDX guest adds complexity
> > > > and limits what KVM can do to harden itself.  Currently, all feature support in
> > > > KVM is effectively frozen at module load.  E.g. most of the setup code is
> > > > contained in __init functions, many module-scoped variables are effectively 
> > > > RO after init (though they can't be marked as such until we smush kvm-intel.ko
> > > > and kvm-amd.ko into kvm.ko, which is tentatively the long-term plan).  All of
> > > > those patterns would get tossed aside if KVM waits until userspace attempts to
> > > > create the first guest.
> > > 
> > > ....
> > > 
> > > People got poked and the following was suggested:
> > > 
> > > On boot do:
> > > 
> > >  TDH.SYS.INIT
> > >  TDH.SYS.LP.INIT
> > >  TDH.SYS.CONFIG
> > >  TDH.SYS.KEY.CONFIG
> > > 
> > > This should get TDX mostly sorted, but doesn't consume much resources.
> > > Then later, when starting the first TDX guest, do the whole
> > > 
> > >  TDH.TDMR.INIT
> > > 
> > > dance to set up the PAMT array -- which is what gobbles up memory. From
> > > what I understand the TDH.TDMR.INIT thing is not one of those
> > > excessively long calls.
> > 
> > The TDH.TDMR.INIT itself has it's own latency requirement implemented in the TDX
> > module, thus it only initializes a small chunk  (1M I guess) in each call. 
> > Therefore we need a loop to do bunch of TDH.TDMR.INIT in order to initialize all
> > PAMT entries for all TDX-usable memory, which can be time-consuming.
> 
> Yeah, so you can put a cond_resched() in that loop and all is well, you
> do not negatively affect other tasks. Because *that* was the concern
> raised.

Yes cond_resched() has been done.  It's in patch 16 (x86/virt/tdx: Initialize
all TDMRs).

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 15/22] x86/virt/tdx: Configure global KeyID on all packages
  2023-06-26 14:12 ` [PATCH v12 15/22] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
@ 2023-07-05  8:13   ` Yuan Yao
  0 siblings, 0 replies; 159+ messages in thread
From: Yuan Yao @ 2023-07-05  8:13 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, peterz, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:45AM +1200, Kai Huang wrote:
> After the list of TDMRs and the global KeyID are configured to the TDX
> module, the kernel needs to configure the key of the global KeyID on all
> packages using TDH.SYS.KEY.CONFIG.
>
> This SEAMCALL cannot run parallel on different cpus.  Loop all online
> cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of
> each package.
>
> To keep things simple, this implementation takes no affirmative steps to
> online cpus to make sure there's at least one cpu for each package.  The
> callers (aka. KVM) can ensure success by ensuring sufficient CPUs are
> online for this to succeed.
>
> Intel hardware doesn't guarantee cache coherency across different
> KeyIDs.  The PAMTs are transitioning from being used by the kernel
> mapping (KeyId 0) to the TDX module's "global KeyID" mapping.
>
> This means that the kernel must flush any dirty KeyID-0 PAMT cachelines
> before the TDX module uses the global KeyID to access the PAMTs.
> Otherwise, if those dirty cachelines were written back, they would
> corrupt the TDX module's metadata.  Aside: This corruption would be
> detected by the memory integrity hardware on the next read of the memory
> with the global KeyID.  The result would likely be fatal to the system
> but would not impact TDX security.
>
> Following the TDX module specification, flush cache before configuring
> the global KeyID on all packages.  Given the PAMT size can be large
> (~1/256th of system RAM), just use WBINVD on all CPUs to flush.
>
> If TDH.SYS.KEY.CONFIG fails, the TDX module may already have used the
> global KeyID to write the PAMTs.  Therefore, use WBINVD to flush cache
> before returning the PAMTs back to the kernel.  Also convert all PAMTs
> back to normal by using MOVDIR64B as suggested by the TDX module spec,
> although on the platform without the "partial write machine check"
> erratum it's OK to leave PAMTs as is.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>
> v11 -> v12:
>  - Added Kirill's tag
>  - Improved changelog (Nikolay)
>
> v10 -> v11:
>  - Convert PAMTs back to normal when module initialization fails.
>  - Fixed an error in changelog
>
> v9 -> v10:
>  - Changed to use 'smp_call_on_cpu()' directly to do key configuration.
>
> v8 -> v9:
>  - Improved changelog (Dave).
>  - Improved comments to explain the function to configure global KeyID
>    "takes no affirmative action to online any cpu". (Dave).
>  - Improved other comments suggested by Dave.
>
> v7 -> v8: (Dave)
>  - Changelog changes:
>   - Point out this is the step of "multi-steps" of init_tdx_module().
>   - Removed MOVDIR64B part.
>   - Other changes due to removing TDH.SYS.SHUTDOWN and TDH.SYS.LP.INIT.
>  - Changed to loop over online cpus and use smp_call_function_single()
>    directly as the patch to shut down TDX module has been removed.
>  - Removed MOVDIR64B part in comment.
>
> v6 -> v7:
>  - Improved changelong and comment to explain why MOVDIR64B isn't used
>    when returning PAMTs back to the kernel.
>
>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 135 +++++++++++++++++++++++++++++++++++-
>  arch/x86/virt/vmx/tdx/tdx.h |   1 +
>  2 files changed, 134 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 1992245290de..f5d4dbc11aee 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -31,6 +31,7 @@
>  #include <asm/msr.h>
>  #include <asm/archrandom.h>
>  #include <asm/page.h>
> +#include <asm/special_insns.h>
>  #include <asm/tdx.h>
>  #include "tdx.h"
>
> @@ -577,7 +578,8 @@ static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base,
>  	*pamt_size = pamt_sz;
>  }
>
> -static void tdmr_free_pamt(struct tdmr_info *tdmr)
> +static void tdmr_do_pamt_func(struct tdmr_info *tdmr,
> +		void (*pamt_func)(unsigned long base, unsigned long size))
>  {
>  	unsigned long pamt_base, pamt_size;
>
> @@ -590,9 +592,19 @@ static void tdmr_free_pamt(struct tdmr_info *tdmr)
>  	if (WARN_ON_ONCE(!pamt_base))
>  		return;
>
> +	(*pamt_func)(pamt_base, pamt_size);
> +}
> +
> +static void free_pamt(unsigned long pamt_base, unsigned long pamt_size)
> +{
>  	free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT);
>  }
>
> +static void tdmr_free_pamt(struct tdmr_info *tdmr)
> +{
> +	tdmr_do_pamt_func(tdmr, free_pamt);
> +}
> +
>  static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list)
>  {
>  	int i;
> @@ -621,6 +633,41 @@ static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
>  	return ret;
>  }
>
> +/*
> + * Convert TDX private pages back to normal by using MOVDIR64B to
> + * clear these pages.  Note this function doesn't flush cache of
> + * these TDX private pages.  The caller should make sure of that.
> + */
> +static void reset_tdx_pages(unsigned long base, unsigned long size)
> +{
> +	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
> +	unsigned long phys, end;
> +
> +	end = base + size;
> +	for (phys = base; phys < end; phys += 64)
> +		movdir64b(__va(phys), zero_page);

Worried write overflow at beginning but then I recalled that
PAMT size is 4KB aligned for 1G/2M/4K entries, thus:

Reviewed-by: Yuan Yao <yuan.yao@intel.com>

> +
> +	/*
> +	 * MOVDIR64B uses WC protocol.  Use memory barrier to
> +	 * make sure any later user of these pages sees the
> +	 * updated data.
> +	 */
> +	mb();
> +}
> +
> +static void tdmr_reset_pamt(struct tdmr_info *tdmr)
> +{
> +	tdmr_do_pamt_func(tdmr, reset_tdx_pages);
> +}
> +
> +static void tdmrs_reset_pamt_all(struct tdmr_info_list *tdmr_list)
> +{
> +	int i;
> +
> +	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++)
> +		tdmr_reset_pamt(tdmr_entry(tdmr_list, i));
> +}
> +
>  static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list)
>  {
>  	unsigned long pamt_size = 0;
> @@ -898,6 +945,55 @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
>  	return ret;
>  }
>
> +static int do_global_key_config(void *data)
> +{
> +	/*
> +	 * TDH.SYS.KEY.CONFIG may fail with entropy error (which is a
> +	 * recoverable error).  Assume this is exceedingly rare and
> +	 * just return error if encountered instead of retrying.
> +	 *
> +	 * All '0's are just unused parameters.
> +	 */
> +	return seamcall(TDH_SYS_KEY_CONFIG, 0, 0, 0, 0, NULL, NULL);
> +}
> +
> +/*
> + * Attempt to configure the global KeyID on all physical packages.
> + *
> + * This requires running code on at least one CPU in each package.  If a
> + * package has no online CPUs, that code will not run and TDX module
> + * initialization (TDMR initialization) will fail.
> + *
> + * This code takes no affirmative steps to online CPUs.  Callers (aka.
> + * KVM) can ensure success by ensuring sufficient CPUs are online for
> + * this to succeed.
> + */
> +static int config_global_keyid(void)
> +{
> +	cpumask_var_t packages;
> +	int cpu, ret = -EINVAL;
> +
> +	if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	for_each_online_cpu(cpu) {
> +		if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
> +					packages))
> +			continue;
> +
> +		/*
> +		 * TDH.SYS.KEY.CONFIG cannot run concurrently on
> +		 * different cpus, so just do it one by one.
> +		 */
> +		ret = smp_call_on_cpu(cpu, do_global_key_config, NULL, true);
> +		if (ret)
> +			break;
> +	}
> +
> +	free_cpumask_var(packages);
> +	return ret;
> +}
> +
>  static int init_tdx_module(void)
>  {
>  	struct tdsysinfo_struct *sysinfo;
> @@ -956,15 +1052,47 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out_free_pamts;
>
> +	/*
> +	 * Hardware doesn't guarantee cache coherency across different
> +	 * KeyIDs.  The kernel needs to flush PAMT's dirty cachelines
> +	 * (associated with KeyID 0) before the TDX module can use the
> +	 * global KeyID to access the PAMT.  Given PAMTs are potentially
> +	 * large (~1/256th of system RAM), just use WBINVD on all cpus
> +	 * to flush the cache.
> +	 */
> +	wbinvd_on_all_cpus();
> +
> +	/* Config the key of global KeyID on all packages */
> +	ret = config_global_keyid();
> +	if (ret)
> +		goto out_reset_pamts;
> +
>  	/*
>  	 * TODO:
>  	 *
> -	 *  - Configure the global KeyID on all packages.
>  	 *  - Initialize all TDMRs.
>  	 *
>  	 *  Return error before all steps are done.
>  	 */
>  	ret = -EINVAL;
> +out_reset_pamts:
> +	if (ret) {
> +		/*
> +		 * Part of PAMTs may already have been initialized by the
> +		 * TDX module.  Flush cache before returning PAMTs back
> +		 * to the kernel.
> +		 */
> +		wbinvd_on_all_cpus();
> +		/*
> +		 * According to the TDX hardware spec, if the platform
> +		 * doesn't have the "partial write machine check"
> +		 * erratum, any kernel read/write will never cause #MC
> +		 * in kernel space, thus it's OK to not convert PAMTs
> +		 * back to normal.  But do the conversion anyway here
> +		 * as suggested by the TDX spec.
> +		 */
> +		tdmrs_reset_pamt_all(&tdmr_list);
> +	}
>  out_free_pamts:
>  	if (ret)
>  		tdmrs_free_pamt_all(&tdmr_list);
> @@ -1019,6 +1147,9 @@ static int __tdx_enable(void)
>   * lock to prevent any new cpu from becoming online; 2) done both VMXON
>   * and tdx_cpu_enable() on all online cpus.
>   *
> + * This function requires there's at least one online cpu for each CPU
> + * package to succeed.
> + *
>   * This function can be called in parallel by multiple callers.
>   *
>   * Return 0 if TDX is enabled successfully, otherwise error.
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index c386aa3afe2a..a0438513bec0 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -21,6 +21,7 @@
>  /*
>   * TDX module SEAMCALL leaf functions
>   */
> +#define TDH_SYS_KEY_CONFIG	31
>  #define TDH_SYS_INFO		32
>  #define TDH_SYS_INIT		33
>  #define TDH_SYS_LP_INIT		35
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-07-03 12:15               ` Huang, Kai
@ 2023-07-05 10:21                 ` Peter Zijlstra
  2023-07-05 11:34                   ` Huang, Kai
  0 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-07-05 10:21 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, Hansen, Dave, ak,
	Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, tglx, linux-kernel, linux-mm, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On Mon, Jul 03, 2023 at 12:15:13PM +0000, Huang, Kai wrote:
> 
> > 
> > So I think the below deals with everything and unifies __tdx_hypercall()
> > and __tdx_module_call(), since both sides needs to deal with exactly the
> > same trainwreck.
> 
> Hi Peter,
> 
> Just want to make sure I understand you correctly:
> 
> You want to make __tdx_module_call() look like __tdx_hypercall(), but not to
> unify them into one assembly (at least for now), right?

Well, given the horrendous trainwreck this is all turning into, I
through it prudent to have it all in a single place. The moment you go
play games with callee-saved registers you're really close to what
hypercall does so then they might as well be the same.

> I am confused you mentioned VP.VMCALL below, which is handled by
> __tdx_hypercall().

But why? It really isn't *that* special if you consider the other calls
that are using callee-saved regs, yes it has the rdi/rsi extra, but meh,
it really just is tdcall-0.


> >  *-------------------------------------------------------------------------
> >  * TDCALL/SEAMCALL ABI:
> >  *-------------------------------------------------------------------------
> >  * Input Registers:
> >  *
> >  * RAX                 - Leaf number.
> >  * RCX,RDX,R8-R11      - Leaf specific input registers.
> >  * RDI,RSI,RBX,R11-R15 - VP.VMCALL VP.ENTER
> >  *
> >  * Output Registers:
> >  *
> >  * RAX                 - instruction error code.
> >  * RCX,RDX,R8-R11      - Leaf specific output registers.
> >  * RDI,RSI,RBX,R12-R15 - VP.VMCALL VP.ENTER
> 
> As mentioned above, VP.VMCALL is handled by __tdx_hypercall().  Also, VP.ENTER
> will be handled by KVM's own assembly.  They both are not handled in this
> TDX_MODULE_CALL assembly.

I don't think they should be special, they're really just yet another
leaf call. Yes, they have a shit calling convention, and yes VP.ENTER is
terminally broken for unconditionally clobbering BP :-(

That really *must* be fixed.

> > .Lcall:
> > .if \host
> > 	seamcall
> > 	/*
> > 	 * SEAMCALL instruction is essentially a VMExit from VMX root
> > 	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
> > 	 * that the targeted SEAM firmware is not loaded or disabled,
> > 	 * or P-SEAMLDR is busy with another SEAMCALL. RAX is not
> > 	 * changed in this case.
> > 	 */
> > 	jc	.Lseamfail
> > 
> > .if \saved && \ret
> > 	/*
> > 	 * VP.ENTER clears RSI on output, use it to restore state.
> > 	 */
> > 	popq	%rsi
> > 	xor	%edi,%edi
> > 	movq	%rdi, TDX_MODULE_rdi(%rsi)
> > 	movq	%rdi, TDX_MODULE_rsi(%rsi)
> > .endif
> > .else
> > 	tdcall
> > 
> > 	/*
> > 	 * RAX!=0 indicates a failure, assume no return values.
> > 	 */
> > 	testq	%rax, %rax
> > 	jne	.Lerror
> 
> For some SEAMCALL/TDCALL the output registers may contain additional error
> information.  We need to jump to a location where whether returning those
> additional regs to 'struct tdx_module_args' depends on \ret.

I suppose we can move this into the below conditional :-( The [DS]I
register stuff requires a scratch reg to recover, AX being zero provides
that.

> > .if \saved && \ret
> > 	/*
> > 	 * Since RAX==0, it can be used as a scratch register to restore state.
> > 	 *
> > 	 * [ assumes \saved implies \ret ]
> > 	 */
> > 	popq	%rax
> > 	movq	%rdi, TDX_MODULE_rdi(%rax)
> > 	movq	%rsi, TDX_MODULE_rsi(%rax)
> > 	movq	%rax, %rsi
> > 	xor	%eax, %eax;
> > .endif
> > .endif // \host

So the reason I want this, is that I feel very strongly that if you
cannot write a single coherent wrapper for all this, its calling
convention is fundamentally *too* complex / broken.



^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-07-05 10:21                 ` Peter Zijlstra
@ 2023-07-05 11:34                   ` Huang, Kai
  2023-07-05 12:19                     ` Peter Zijlstra
  2023-07-05 12:21                     ` Peter Zijlstra
  0 siblings, 2 replies; 159+ messages in thread
From: Huang, Kai @ 2023-07-05 11:34 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, Hansen, Dave, ak,
	Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, linux-mm, linux-kernel, tglx, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On Wed, 2023-07-05 at 12:21 +0200, Peter Zijlstra wrote:
> On Mon, Jul 03, 2023 at 12:15:13PM +0000, Huang, Kai wrote:
> > 
> > > 
> > > So I think the below deals with everything and unifies __tdx_hypercall()
> > > and __tdx_module_call(), since both sides needs to deal with exactly the
> > > same trainwreck.
> > 
> > Hi Peter,
> > 
> > Just want to make sure I understand you correctly:
> > 
> > You want to make __tdx_module_call() look like __tdx_hypercall(), but not to
> > unify them into one assembly (at least for now), right?
> 
> Well, given the horrendous trainwreck this is all turning into, I
> through it prudent to have it all in a single place. The moment you go
> play games with callee-saved registers you're really close to what
> hypercall does so then they might as well be the same.

OK I understand you now.  Thanks.

Yeah I think from long-term's view, since SEAMCALLs to support live migration
pretty much uses all RCX/RDX/R8-R15 as input/output, it seems reasonable to
unify all of them, although I guess there might be some special handling to
VP.VMCALL and/or VP.ENTER, e.g., below:

        /* TDVMCALL leaf return code is in R10 */                              
        movq %r10, %rax

So long-termly, I don't have objection to that.  But my thinking is for the
first version of TDX host support, we don't have to support all SEAMCALLs but
only those involved in basic TDX support.  Those SEAMCALLs (except VP.ENTER)
only uses RCX/RDX/R8/R9 as input and RCX/RDX/R8-R11 as output, so to me it looks
fine to only make __tdx_module_call() look like __tdx_hypercall() as the first
step.

Also, the new SEAMCALLs to handle live migration all seem to have below
statement:

	AVX, AVX2	May be reset to the architectural INIT state
	and
	AVX512
	state

Which means those SEAMCALLs need to preserve AVX* states too?

And reading the spec, the VP.VMCALL and VP.ENTER also can use XMM0 - XMM15 as
input/output.  Linux VP.VMCALL seems doesn't support using XMM0 - XMM15 as
input/output, but KVM can run other guest OSes too so I think KVM VP.ENTER needs
to handle XMM0-XMM15 as input/output too.

That being said, I think although we can provide a common asm macro to cover
VP.ENTER, I suspect KVM still needs to do additional assembly around the macro
too.  So I am not sure whether we should try to cover VP.ENTER.

And I don't want to speak for KVM maintainers. :)

Hi Sean/Paolo, do you have any comments here?

> 
> > I am confused you mentioned VP.VMCALL below, which is handled by
> > __tdx_hypercall().
> 
> But why? It really isn't *that* special if you consider the other calls
> that are using callee-saved regs, yes it has the rdi/rsi extra, but meh,
> it really just is tdcall-0.

As mentioned above I don't have objection to this :)

> 
> 
> > >  *-------------------------------------------------------------------------
> > >  * TDCALL/SEAMCALL ABI:
> > >  *-------------------------------------------------------------------------
> > >  * Input Registers:
> > >  *
> > >  * RAX                 - Leaf number.
> > >  * RCX,RDX,R8-R11      - Leaf specific input registers.
> > >  * RDI,RSI,RBX,R11-R15 - VP.VMCALL VP.ENTER
> > >  *
> > >  * Output Registers:
> > >  *
> > >  * RAX                 - instruction error code.
> > >  * RCX,RDX,R8-R11      - Leaf specific output registers.
> > >  * RDI,RSI,RBX,R12-R15 - VP.VMCALL VP.ENTER
> > 
> > As mentioned above, VP.VMCALL is handled by __tdx_hypercall().  Also, VP.ENTER
> > will be handled by KVM's own assembly.  They both are not handled in this
> > TDX_MODULE_CALL assembly.
> 
> I don't think they should be special, they're really just yet another
> leaf call. Yes, they have a shit calling convention, and yes VP.ENTER is
> terminally broken for unconditionally clobbering BP :-(
> 
> That really *must* be fixed.

Sure I don't have objection to this, and for VP.ENTER please see above.

But I'd like to say that, generally speaking, from virtualization's point of
view, guest has its own BP and conceptually the hypervisor needs to restore
guest's BP before jumping to the guest.  E.g., for normal VMX guest, KVM always
restores guest's BP before VMENTER (arch/x86/kvm/vmx/vmenter.S):

SYM_FUNC_START(__vmx_vcpu_run)
        push %_ASM_BP
        mov  %_ASM_SP, %_ASM_BP
	
	...
	mov VCPU_RBP(%_ASM_AX), %_ASM_BP
	...
	vmenter/vmresume
	...
SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)
	.....
	mov %_ASM_BP, VCPU_RBP(%_ASM_AX)
	...
	pop %_ASM_BP
        RET

> 
> > > .Lcall:
> > > .if \host
> > > 	seamcall
> > > 	/*
> > > 	 * SEAMCALL instruction is essentially a VMExit from VMX root
> > > 	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
> > > 	 * that the targeted SEAM firmware is not loaded or disabled,
> > > 	 * or P-SEAMLDR is busy with another SEAMCALL. RAX is not
> > > 	 * changed in this case.
> > > 	 */
> > > 	jc	.Lseamfail
> > > 
> > > .if \saved && \ret
> > > 	/*
> > > 	 * VP.ENTER clears RSI on output, use it to restore state.
> > > 	 */
> > > 	popq	%rsi
> > > 	xor	%edi,%edi
> > > 	movq	%rdi, TDX_MODULE_rdi(%rsi)
> > > 	movq	%rdi, TDX_MODULE_rsi(%rsi)
> > > .endif
> > > .else
> > > 	tdcall
> > > 
> > > 	/*
> > > 	 * RAX!=0 indicates a failure, assume no return values.
> > > 	 */
> > > 	testq	%rax, %rax
> > > 	jne	.Lerror
> > 
> > For some SEAMCALL/TDCALL the output registers may contain additional error
> > information.  We need to jump to a location where whether returning those
> > additional regs to 'struct tdx_module_args' depends on \ret.
> 
> I suppose we can move this into the below conditional :-( The [DS]I
> register stuff requires a scratch reg to recover, AX being zero provides
> that.

Yeah this can certainly be done in one way or another.

> 
> > > .if \saved && \ret
> > > 	/*
> > > 	 * Since RAX==0, it can be used as a scratch register to restore state.
> > > 	 *
> > > 	 * [ assumes \saved implies \ret ]
> > > 	 */
> > > 	popq	%rax
> > > 	movq	%rdi, TDX_MODULE_rdi(%rax)
> > > 	movq	%rsi, TDX_MODULE_rsi(%rax)
> > > 	movq	%rax, %rsi
> > > 	xor	%eax, %eax;
> > > .endif
> > > .endif // \host
> 
> So the reason I want this, is that I feel very strongly that if you
> cannot write a single coherent wrapper for all this, its calling
> convention is fundamentally *too* complex / broken.

In general I agree, but I am not sure whether there's any detail holding us
back.  :)


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-07-05 11:34                   ` Huang, Kai
@ 2023-07-05 12:19                     ` Peter Zijlstra
  2023-07-05 12:53                       ` Huang, Kai
  2023-07-05 12:21                     ` Peter Zijlstra
  1 sibling, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-07-05 12:19 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, Hansen, Dave, ak,
	Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, linux-mm, linux-kernel, tglx, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On Wed, Jul 05, 2023 at 11:34:53AM +0000, Huang, Kai wrote:

> Yeah I think from long-term's view, since SEAMCALLs to support live migration
> pretty much uses all RCX/RDX/R8-R15 as input/output, it seems reasonable to
> unify all of them, although I guess there might be some special handling to
> VP.VMCALL and/or VP.ENTER, e.g., below:
> 
>         /* TDVMCALL leaf return code is in R10 */                              
>         movq %r10, %rax
> 
> So long-termly, I don't have objection to that.  But my thinking is for the
> first version of TDX host support, we don't have to support all SEAMCALLs but
> only those involved in basic TDX support. 

Since those calls are out now, we should look at them now, there is no
point in delaying the pain. That then gives us two options:

 - we accept them and their wonky calling convention and our code should
   be ready for it.

 - we reject them and send the TDX team a message to please try again
   but with a saner calling convention.

Sticking our head in the sand and pretending like they don't exist isn't
really a viable option at this point.

> Also, the new SEAMCALLs to handle live migration all seem to have below
> statement:
> 
> 	AVX, AVX2	May be reset to the architectural INIT state
> 	and
> 	AVX512
> 	state
> 
> Which means those SEAMCALLs need to preserve AVX* states too?

Yes, we need to ensure the userspace 'FPU' state is saved before
we call them. But I _think_ that KVM already does much of that.

> And reading the spec, the VP.VMCALL and VP.ENTER also can use XMM0 - XMM15 as
> input/output.  Linux VP.VMCALL seems doesn't support using XMM0 - XMM15 as
> input/output, but KVM can run other guest OSes too so I think KVM VP.ENTER needs
> to handle XMM0-XMM15 as input/output too.

Why would KVM accept VMCALLs it doesn't know about? Just trash the
guest and call it a day.

> That being said, I think although we can provide a common asm macro to cover
> VP.ENTER, I suspect KVM still needs to do additional assembly around the macro
> too.  So I am not sure whether we should try to cover VP.ENTER.

Not sure about asm, we have interfaces to save the XMM/AVX regs.
kernel_fpu_begin() comes to mind, but I know there's more of that,
including some for KVM specifically.

> > I don't think they should be special, they're really just yet another
> > leaf call. Yes, they have a shit calling convention, and yes VP.ENTER is
> > terminally broken for unconditionally clobbering BP :-(
> > 
> > That really *must* be fixed.
> 
> Sure I don't have objection to this, and for VP.ENTER please see above.
> 
> But I'd like to say that, generally speaking, from virtualization's point of
> view, guest has its own BP and conceptually the hypervisor needs to restore
> guest's BP before jumping to the guest.  E.g., for normal VMX guest, KVM always
> restores guest's BP before VMENTER (arch/x86/kvm/vmx/vmenter.S):
> 
> SYM_FUNC_START(__vmx_vcpu_run)
>         push %_ASM_BP
>         mov  %_ASM_SP, %_ASM_BP
> 	
> 	...
> 	mov VCPU_RBP(%_ASM_AX), %_ASM_BP
> 	...
> 	vmenter/vmresume
> 	...
> SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)
> 	.....
> 	mov %_ASM_BP, VCPU_RBP(%_ASM_AX)
> 	...
> 	pop %_ASM_BP
>         RET

That's disgusting :/ So what happens if we get an NMI after VMENTER and
before POP? Then it sees a garbage BP value.

Why is all this stuff such utter crap?


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-07-05 11:34                   ` Huang, Kai
  2023-07-05 12:19                     ` Peter Zijlstra
@ 2023-07-05 12:21                     ` Peter Zijlstra
  1 sibling, 0 replies; 159+ messages in thread
From: Peter Zijlstra @ 2023-07-05 12:21 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, Hansen, Dave, ak,
	Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, linux-mm, linux-kernel, tglx, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On Wed, Jul 05, 2023 at 11:34:53AM +0000, Huang, Kai wrote:

> Yeah I think from long-term's view, since SEAMCALLs to support live migration
> pretty much uses all RCX/RDX/R8-R15 as input/output, it seems reasonable to
> unify all of them, although I guess there might be some special handling to
> VP.VMCALL and/or VP.ENTER, e.g., below:
> 
>         /* TDVMCALL leaf return code is in R10 */                              
>         movq %r10, %rax

Well, that's a problem fo whoever does tdcall(0, &args), no?

But did I say it had a crap calling convention already?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-07-05 12:19                     ` Peter Zijlstra
@ 2023-07-05 12:53                       ` Huang, Kai
  2023-07-05 20:56                         ` Isaku Yamahata
  0 siblings, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-07-05 12:53 UTC (permalink / raw)
  To: peterz
  Cc: kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, Hansen, Dave, ak,
	Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, tglx, linux-kernel, linux-mm, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On Wed, 2023-07-05 at 14:19 +0200, Peter Zijlstra wrote:
> On Wed, Jul 05, 2023 at 11:34:53AM +0000, Huang, Kai wrote:
> 
> > Yeah I think from long-term's view, since SEAMCALLs to support live migration
> > pretty much uses all RCX/RDX/R8-R15 as input/output, it seems reasonable to
> > unify all of them, although I guess there might be some special handling to
> > VP.VMCALL and/or VP.ENTER, e.g., below:
> > 
> >         /* TDVMCALL leaf return code is in R10 */                              
> >         movq %r10, %rax
> > 
> > So long-termly, I don't have objection to that.  But my thinking is for the
> > first version of TDX host support, we don't have to support all SEAMCALLs but
> > only those involved in basic TDX support. 
> 
> Since those calls are out now, we should look at them now, there is no
> point in delaying the pain. That then gives us two options:
> 
>  - we accept them and their wonky calling convention and our code should
>    be ready for it.
> 
>  - we reject them and send the TDX team a message to please try again
>    but with a saner calling convention.
> 
> Sticking our head in the sand and pretending like they don't exist isn't
> really a viable option at this point.

OK.  I'll work on this.

But I think even we want to unify __tdx_module_call() and __tdx_hypercall(), the
first step should be making __tdx_module_call() look like __tdx_hypercall()?  I
mean from organizing patchset's point of view, we cannot just do in one big
patch but need to split into small patches with each doing one thing.

By thinking is perhaps we can organize this way:

 1) Patch(es) to make TDX_MODULE_CALL macro / __tdx_module_call() look like
__tdx_hypercall().
 2) Add SEAMCALL support based on TDX_MODULE_CALL, e.g., implement __seamcall().
 3) Unify __tdx_module_call()/__seamcall() with __tdx_hypercall().

Does this look good?

Btw, I've already part 1) based on your code, and sent the patches to Kirill for
review.  Should I sent them out first?

> 
> > Also, the new SEAMCALLs to handle live migration all seem to have below
> > statement:
> > 
> > 	AVX, AVX2	May be reset to the architectural INIT state
> > 	and
> > 	AVX512
> > 	state
> > 
> > Which means those SEAMCALLs need to preserve AVX* states too?
> 
> Yes, we need to ensure the userspace 'FPU' state is saved before
> we call them. But I _think_ that KVM already does much of that.

Let me look into this.

> 
> > And reading the spec, the VP.VMCALL and VP.ENTER also can use XMM0 - XMM15 as
> > input/output.  Linux VP.VMCALL seems doesn't support using XMM0 - XMM15 as
> > input/output, but KVM can run other guest OSes too so I think KVM VP.ENTER needs
> > to handle XMM0-XMM15 as input/output too.
> 
> Why would KVM accept VMCALLs it doesn't know about? Just trash the
> guest and call it a day.
> 
> > That being said, I think although we can provide a common asm macro to cover
> > VP.ENTER, I suspect KVM still needs to do additional assembly around the macro
> > too.  So I am not sure whether we should try to cover VP.ENTER.
> 
> Not sure about asm, we have interfaces to save the XMM/AVX regs.
> kernel_fpu_begin() comes to mind, but I know there's more of that,
> including some for KVM specifically.

Yeah doesn't have to be asm if it can be done in C.

> 
> > > I don't think they should be special, they're really just yet another
> > > leaf call. Yes, they have a shit calling convention, and yes VP.ENTER is
> > > terminally broken for unconditionally clobbering BP :-(
> > > 
> > > That really *must* be fixed.
> > 
> > Sure I don't have objection to this, and for VP.ENTER please see above.
> > 
> > But I'd like to say that, generally speaking, from virtualization's point of
> > view, guest has its own BP and conceptually the hypervisor needs to restore
> > guest's BP before jumping to the guest.  E.g., for normal VMX guest, KVM always
> > restores guest's BP before VMENTER (arch/x86/kvm/vmx/vmenter.S):
> > 
> > SYM_FUNC_START(__vmx_vcpu_run)
> >         push %_ASM_BP
> >         mov  %_ASM_SP, %_ASM_BP
> > 	
> > 	...
> > 	mov VCPU_RBP(%_ASM_AX), %_ASM_BP
> > 	...
> > 	vmenter/vmresume
> > 	...
> > SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)
> > 	.....
> > 	mov %_ASM_BP, VCPU_RBP(%_ASM_AX)
> > 	...
> > 	pop %_ASM_BP
> >         RET
> 
> That's disgusting :/ So what happens if we get an NMI after VMENTER and
> before POP? Then it sees a garbage BP value.

Looks so.

> 
> Why is all this stuff such utter crap?
> 

The problem is KVM has to save/restore BP for guest, because VMX hardware
doesn't save/restore BP during VMENTER/VMEXIT.  I am not sure whether there's a
better way to handle.

My brain is getting slow right now as it's 1-hour past midnight already.  I am
hoping Paolo/Sean can jump in here. :)

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-07-04 16:58                 ` Peter Zijlstra
  2023-07-04 21:50                   ` Huang, Kai
@ 2023-07-05 14:34                   ` Dave Hansen
  2023-07-05 14:57                     ` Peter Zijlstra
  1 sibling, 1 reply; 159+ messages in thread
From: Dave Hansen @ 2023-07-05 14:34 UTC (permalink / raw)
  To: Peter Zijlstra, Sean Christopherson
  Cc: Isaku Yamahata, Kai Huang, kvm, Ashok Raj, Tony Luck, david,
	bagasdotme, ak, Rafael J Wysocki, kirill.shutemov,
	Reinette Chatre, pbonzini, mingo, tglx, linux-kernel, linux-mm,
	Isaku Yamahata, nik.borisov, hpa, Sagi Shahar, imammedo, bp,
	Chao Gao, Len Brown, sathyanarayanan.kuppuswamy, Ying Huang,
	Dan J Williams, x86

On 7/4/23 09:58, Peter Zijlstra wrote:
> If we have concerns about allocating the PAMT array, can't we use CMA
> for this? Allocate the whole thing at boot as CMA such that when not
> used for TDX it can be used for regular things like userspace and
> filecache pages?

I never thought of CMA as being super reliable.  Maybe it's improved
over the years.

KVM also has a rather nasty habit of pinning pages, like for device
passthrough.  I suspect that means that we'll have one of two scenarios:

 1. CMA works great, but the TDX/CMA area is unusable for KVM because
    it's pinning all its pages and they just get moved out of the CMA
    area immediately.  The CMA area is effectively wasted.
 2. CMA sucks, and users get sporadic TDX failures when they wait a long
    time to run a TDX guest after boot.  Users just work around the CMA
    support by starting up TDX guests at boot or demanding a module
    parameter be set.  Hacking in CMA support was a waste.

Am I just too much of a pessimist?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-07-05 14:34                   ` Dave Hansen
@ 2023-07-05 14:57                     ` Peter Zijlstra
  2023-07-06 14:49                       ` Dave Hansen
  0 siblings, 1 reply; 159+ messages in thread
From: Peter Zijlstra @ 2023-07-05 14:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Sean Christopherson, Isaku Yamahata, Kai Huang, kvm, Ashok Raj,
	Tony Luck, david, bagasdotme, ak, Rafael J Wysocki,
	kirill.shutemov, Reinette Chatre, pbonzini, mingo, tglx,
	linux-kernel, linux-mm, Isaku Yamahata, nik.borisov, hpa,
	Sagi Shahar, imammedo, bp, Chao Gao, Len Brown,
	sathyanarayanan.kuppuswamy, Ying Huang, Dan J Williams, x86

On Wed, Jul 05, 2023 at 07:34:06AM -0700, Dave Hansen wrote:
> On 7/4/23 09:58, Peter Zijlstra wrote:
> > If we have concerns about allocating the PAMT array, can't we use CMA
> > for this? Allocate the whole thing at boot as CMA such that when not
> > used for TDX it can be used for regular things like userspace and
> > filecache pages?
> 
> I never thought of CMA as being super reliable.  Maybe it's improved
> over the years.
> 
> KVM also has a rather nasty habit of pinning pages, like for device
> passthrough.  I suspect that means that we'll have one of two scenarios:
> 
>  1. CMA works great, but the TDX/CMA area is unusable for KVM because
>     it's pinning all its pages and they just get moved out of the CMA
>     area immediately.  The CMA area is effectively wasted.
>  2. CMA sucks, and users get sporadic TDX failures when they wait a long
>     time to run a TDX guest after boot.  Users just work around the CMA
>     support by starting up TDX guests at boot or demanding a module
>     parameter be set.  Hacking in CMA support was a waste.
> 
> Am I just too much of a pessimist?

Well, if CMA still sucks, then that needs fixing. If CMA works, but we
have a circular fail in that KVM needs to long-term pin the PAMT pages
but long-term pin is evicted from CMA (the whole point of long-term pin,
after all), then surely we can break that cycle somehow, since in this
case the purpose of the CMA is being able to grab that memory chunk when
we needs it.

That is, either way around is just a matter of a little code, no?

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP
  2023-07-05 12:53                       ` Huang, Kai
@ 2023-07-05 20:56                         ` Isaku Yamahata
  0 siblings, 0 replies; 159+ messages in thread
From: Isaku Yamahata @ 2023-07-05 20:56 UTC (permalink / raw)
  To: Huang, Kai
  Cc: peterz, kvm, Raj, Ashok, Luck, Tony, david, bagasdotme, Hansen,
	Dave, ak, Wysocki, Rafael J, kirill.shutemov, Chatre, Reinette,
	Christopherson,,
	Sean, pbonzini, mingo, tglx, linux-kernel, linux-mm, Yamahata,
	Isaku, nik.borisov, hpa, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86, isaku.yamahata

On Wed, Jul 05, 2023 at 12:53:58PM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Wed, 2023-07-05 at 14:19 +0200, Peter Zijlstra wrote:
> > On Wed, Jul 05, 2023 at 11:34:53AM +0000, Huang, Kai wrote:
> > 
> > > Yeah I think from long-term's view, since SEAMCALLs to support live migration
> > > pretty much uses all RCX/RDX/R8-R15 as input/output, it seems reasonable to
> > > unify all of them, although I guess there might be some special handling to
> > > VP.VMCALL and/or VP.ENTER, e.g., below:
> > > 
> > >         /* TDVMCALL leaf return code is in R10 */                              
> > >         movq %r10, %rax
> > > 
> > > So long-termly, I don't have objection to that.  But my thinking is for the
> > > first version of TDX host support, we don't have to support all SEAMCALLs but
> > > only those involved in basic TDX support. 
> > 
> > Since those calls are out now, we should look at them now, there is no
> > point in delaying the pain. That then gives us two options:
> > 
> >  - we accept them and their wonky calling convention and our code should
> >    be ready for it.
> > 
> >  - we reject them and send the TDX team a message to please try again
> >    but with a saner calling convention.
> > 
> > Sticking our head in the sand and pretending like they don't exist isn't
> > really a viable option at this point.
> 
> OK.  I'll work on this.
> 
> But I think even we want to unify __tdx_module_call() and __tdx_hypercall(), the
> first step should be making __tdx_module_call() look like __tdx_hypercall()?  I
> mean from organizing patchset's point of view, we cannot just do in one big
> patch but need to split into small patches with each doing one thing.
> 
> By thinking is perhaps we can organize this way:
> 
>  1) Patch(es) to make TDX_MODULE_CALL macro / __tdx_module_call() look like
> __tdx_hypercall().
>  2) Add SEAMCALL support based on TDX_MODULE_CALL, e.g., implement __seamcall().
>  3) Unify __tdx_module_call()/__seamcall() with __tdx_hypercall().
> 
> Does this look good?
> 
> Btw, I've already part 1) based on your code, and sent the patches to Kirill for
> review.  Should I sent them out first?
> 
> > 
> > > Also, the new SEAMCALLs to handle live migration all seem to have below
> > > statement:
> > > 
> > > 	AVX, AVX2	May be reset to the architectural INIT state
> > > 	and
> > > 	AVX512
> > > 	state
> > > 
> > > Which means those SEAMCALLs need to preserve AVX* states too?
> > 
> > Yes, we need to ensure the userspace 'FPU' state is saved before
> > we call them. But I _think_ that KVM already does much of that.
> 
> Let me look into this.

KVM VCPU_RUN ioctl saves/restores FPU state by kvm_load_guest_fpu() and
kvm_put_guest_fpu() which calls fpu_swap_kvm_fpstate().
Other KVM ioctls doesn't modify FPU.  Because some SEAMCALLs related for live
migration don't preserve FPU state, we need explicit save/restore of FPU state.

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 16/22] x86/virt/tdx: Initialize all TDMRs
  2023-06-26 14:12 ` [PATCH v12 16/22] x86/virt/tdx: Initialize all TDMRs Kai Huang
@ 2023-07-06  5:31   ` Yuan Yao
  0 siblings, 0 replies; 159+ messages in thread
From: Yuan Yao @ 2023-07-06  5:31 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, peterz, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:46AM +1200, Kai Huang wrote:
> After the global KeyID has been configured on all packages, initialize
> all TDMRs to make all TDX-usable memory regions that are passed to the
> TDX module become usable.
>
> This is the last step of initializing the TDX module.
>
> Initializing TDMRs can be time consuming on large memory systems as it
> involves initializing all metadata entries for all pages that can be
> used by TDX guests.  Initializing different TDMRs can be parallelized.
> For now to keep it simple, just initialize all TDMRs one by one.  It can
> be enhanced in the future.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>

>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>
> v11 -> v12:
>  - Added Kirill's tag
>
> v10 -> v11:
>  - No update
>
> v9 -> v10:
>  - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'.
>
> v8 -> v9:
>  - Improved changlog to explain why initializing TDMRs can take long
>    time (Dave).
>  - Improved comments around 'next-to-initialize' address (Dave).
>
> v7 -> v8: (Dave)
>  - Changelog:
>    - explicitly call out this is the last step of TDX module initialization.
>    - Trimed down changelog by removing SEAMCALL name and details.
>  - Removed/trimmed down unnecessary comments.
>  - Other changes due to 'struct tdmr_info_list'.
>
> v6 -> v7:
>  - Removed need_resched() check. -- Andi.
>
>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 60 ++++++++++++++++++++++++++++++++-----
>  arch/x86/virt/vmx/tdx/tdx.h |  1 +
>  2 files changed, 53 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index f5d4dbc11aee..52b7267ea226 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -994,6 +994,56 @@ static int config_global_keyid(void)
>  	return ret;
>  }
>
> +static int init_tdmr(struct tdmr_info *tdmr)
> +{
> +	u64 next;
> +
> +	/*
> +	 * Initializing a TDMR can be time consuming.  To avoid long
> +	 * SEAMCALLs, the TDX module may only initialize a part of the
> +	 * TDMR in each call.
> +	 */
> +	do {
> +		struct tdx_module_output out;
> +		int ret;
> +
> +		/* All 0's are unused parameters, they mean nothing. */
> +		ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL,
> +				&out);
> +		if (ret)
> +			return ret;
> +		/*
> +		 * RDX contains 'next-to-initialize' address if
> +		 * TDH.SYS.TDMR.INIT did not fully complete and
> +		 * should be retried.
> +		 */
> +		next = out.rdx;
> +		cond_resched();
> +		/* Keep making SEAMCALLs until the TDMR is done */
> +	} while (next < tdmr->base + tdmr->size);
> +
> +	return 0;
> +}
> +
> +static int init_tdmrs(struct tdmr_info_list *tdmr_list)
> +{
> +	int i;
> +
> +	/*
> +	 * This operation is costly.  It can be parallelized,
> +	 * but keep it simple for now.
> +	 */
> +	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> +		int ret;
> +
> +		ret = init_tdmr(tdmr_entry(tdmr_list, i));
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
>  static int init_tdx_module(void)
>  {
>  	struct tdsysinfo_struct *sysinfo;
> @@ -1067,14 +1117,8 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out_reset_pamts;
>
> -	/*
> -	 * TODO:
> -	 *
> -	 *  - Initialize all TDMRs.
> -	 *
> -	 *  Return error before all steps are done.
> -	 */
> -	ret = -EINVAL;
> +	/* Initialize TDMRs to complete the TDX module initialization */
> +	ret = init_tdmrs(&tdmr_list);
>  out_reset_pamts:
>  	if (ret) {
>  		/*
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index a0438513bec0..f6b4e153890d 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -25,6 +25,7 @@
>  #define TDH_SYS_INFO		32
>  #define TDH_SYS_INIT		33
>  #define TDH_SYS_LP_INIT		35
> +#define TDH_SYS_TDMR_INIT	36
>  #define TDH_SYS_CONFIG		45
>
>  struct cmr_info {
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-07-05 14:57                     ` Peter Zijlstra
@ 2023-07-06 14:49                       ` Dave Hansen
  2023-07-10 17:58                         ` Sean Christopherson
  0 siblings, 1 reply; 159+ messages in thread
From: Dave Hansen @ 2023-07-06 14:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sean Christopherson, Isaku Yamahata, Kai Huang, kvm, Ashok Raj,
	Tony Luck, david, bagasdotme, ak, Rafael J Wysocki,
	kirill.shutemov, Reinette Chatre, pbonzini, mingo, tglx,
	linux-kernel, linux-mm, Isaku Yamahata, nik.borisov, hpa,
	Sagi Shahar, imammedo, bp, Chao Gao, Len Brown,
	sathyanarayanan.kuppuswamy, Ying Huang, Dan J Williams, x86

On 7/5/23 07:57, Peter Zijlstra wrote:
> On Wed, Jul 05, 2023 at 07:34:06AM -0700, Dave Hansen wrote:
>> On 7/4/23 09:58, Peter Zijlstra wrote:
>>> If we have concerns about allocating the PAMT array, can't we use CMA
>>> for this? Allocate the whole thing at boot as CMA such that when not
>>> used for TDX it can be used for regular things like userspace and
>>> filecache pages?
>> I never thought of CMA as being super reliable.  Maybe it's improved
>> over the years.
>>
>> KVM also has a rather nasty habit of pinning pages, like for device
>> passthrough.  I suspect that means that we'll have one of two scenarios:
>>
>>  1. CMA works great, but the TDX/CMA area is unusable for KVM because
>>     it's pinning all its pages and they just get moved out of the CMA
>>     area immediately.  The CMA area is effectively wasted.
>>  2. CMA sucks, and users get sporadic TDX failures when they wait a long
>>     time to run a TDX guest after boot.  Users just work around the CMA
>>     support by starting up TDX guests at boot or demanding a module
>>     parameter be set.  Hacking in CMA support was a waste.
>>
>> Am I just too much of a pessimist?
> Well, if CMA still sucks, then that needs fixing. If CMA works, but we
> have a circular fail in that KVM needs to long-term pin the PAMT pages
> but long-term pin is evicted from CMA (the whole point of long-term pin,
> after all), then surely we can break that cycle somehow, since in this
> case the purpose of the CMA is being able to grab that memory chunk when
> we needs it.
> 
> That is, either way around is just a matter of a little code, no?

It's not a circular dependency, it's conflicting requirements.

CMA makes memory more available, but only in the face of unpinned pages.

KVM can pin lots of pages, even outside of TDX-based VMs.

So we either need to change how CMA works fundamentally or stop KVM from
pinning pages.



^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum
  2023-06-26 14:12 ` [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum Kai Huang
  2023-06-28  9:20   ` Nikolay Borisov
  2023-06-28 12:29   ` kirill.shutemov
@ 2023-07-07  4:01   ` Yuan Yao
  2 siblings, 0 replies; 159+ messages in thread
From: Yuan Yao @ 2023-07-07  4:01 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, peterz, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:49AM +1200, Kai Huang wrote:
> The first few generations of TDX hardware have an erratum.  A partial
> write to a TDX private memory cacheline will silently "poison" the
> line.  Subsequent reads will consume the poison and generate a machine
> check.  According to the TDX hardware spec, neither of these things
> should have happened.
>
> == Background ==
>
> Virtually all kernel memory accesses operations happen in full
> cachelines.  In practice, writing a "byte" of memory usually reads a 64
> byte cacheline of memory, modifies it, then writes the whole line back.
> Those operations do not trigger this problem.
>
> This problem is triggered by "partial" writes where a write transaction
> of less than cacheline lands at the memory controller.  The CPU does
> these via non-temporal write instructions (like MOVNTI), or through
> UC/WC memory mappings.  The issue can also be triggered away from the
> CPU by devices doing partial writes via DMA.
>
> == Problem ==
>
> A fast warm reset doesn't reset TDX private memory.  Kexec() can also
> boot into the new kernel directly.  Thus if the old kernel has enabled
> TDX on the platform with this erratum, the new kernel may get unexpected
> machine check.
>
> Note that w/o this erratum any kernel read/write on TDX private memory
> should never cause machine check, thus it's OK for the old kernel to
> leave TDX private pages as is.
>
> == Solution ==
>
> In short, with this erratum, the kernel needs to explicitly convert all
> TDX private pages back to normal to give the new kernel a clean slate
> after kexec().  The BIOS is also expected to disable fast warm reset as
> a workaround to this erratum, thus this implementation doesn't try to
> reset TDX private memory for the reboot case in the kernel but depend on
> the BIOS to enable the workaround.
>
> For now TDX private memory can only be PAMT pages.  It would be ideal to
> cover all types of TDX private memory here (TDX guest private pages and
> Secure-EPT pages are yet to be implemented when TDX gets supported in
> KVM), but there's no existing infrastructure to track TDX private pages.
> It's not feasible to query the TDX module about page type either because
> VMX has already been stopped when KVM receives the reboot notifier.
>
> Another option is to blindly convert all memory pages.  But this may
> bring non-trivial latency to kexec() on large memory systems (especially
> when the number of TDX private pages is small).  Thus even with this
> temporary solution, eventually it's better for the kernel to only reset
> TDX private pages.  Also, it's problematic to convert all memory pages
> because not all pages are mapped as writable in the direct-mapping.  The
> kernel needs to switch to another page table which maps all pages as
> writable (e.g., the identical-mapping table for kexec(), or a new page
> table) to do so, but this looks overkill.
>
> Therefore, rather than doing something dramatic, only reset PAMT pages
> for now.  Do it in machine_kexec() to avoid additional overhead to the
> machine reboot/shutdown as the kernel depends on the BIOS to disable
> fast warm reset as a workaround for the reboot case.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>
> v11 -> v12:
>  - Changed comment/changelog to say kernel doesn't try to handle fast
>    warm reset but depends on BIOS to enable workaround (Kirill)
>  - Added a new tdx_may_has_private_mem to indicate system may have TDX
>    private memory and PAMTs/TDMRs are stable to access. (Dave).
>  - Use atomic_t for tdx_may_has_private_mem for build-in memory barrier
>    (Dave)
>  - Changed calling x86_platform.memory_shutdown() to calling
>    tdx_reset_memory() directly from machine_kexec() to avoid overhead to
>    normal reboot case.
>
> v10 -> v11:
>  - New patch
>
>
> ---
>  arch/x86/include/asm/tdx.h         |  2 +
>  arch/x86/kernel/machine_kexec_64.c |  9 ++++
>  arch/x86/virt/vmx/tdx/tdx.c        | 79 ++++++++++++++++++++++++++++++
>  3 files changed, 90 insertions(+)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 91416fd600cd..e95c9fbf52e4 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -100,10 +100,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>  bool platform_tdx_enabled(void);
>  int tdx_cpu_enable(void);
>  int tdx_enable(void);
> +void tdx_reset_memory(void);
>  #else	/* !CONFIG_INTEL_TDX_HOST */
>  static inline bool platform_tdx_enabled(void) { return false; }
>  static inline int tdx_cpu_enable(void) { return -ENODEV; }
>  static inline int tdx_enable(void)  { return -ENODEV; }
> +static inline void tdx_reset_memory(void) { }
>  #endif	/* CONFIG_INTEL_TDX_HOST */
>
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 1a3e2c05a8a5..232253bd7ccd 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -28,6 +28,7 @@
>  #include <asm/setup.h>
>  #include <asm/set_memory.h>
>  #include <asm/cpu.h>
> +#include <asm/tdx.h>
>
>  #ifdef CONFIG_ACPI
>  /*
> @@ -301,6 +302,14 @@ void machine_kexec(struct kimage *image)
>  	void *control_page;
>  	int save_ftrace_enabled;
>
> +	/*
> +	 * On the platform with "partial write machine check" erratum,
> +	 * all TDX private pages need to be converted back to normal
> +	 * before booting to the new kernel, otherwise the new kernel
> +	 * may get unexpected machine check.
> +	 */
> +	tdx_reset_memory();
> +
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
>  		save_processor_state();
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 85b24b2e9417..1107f4227568 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -51,6 +51,8 @@ static LIST_HEAD(tdx_memlist);
>
>  static struct tdmr_info_list tdx_tdmr_list;
>
> +static atomic_t tdx_may_has_private_mem;
> +
>  /*
>   * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
>   * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> @@ -1113,6 +1115,17 @@ static int init_tdx_module(void)
>  	 */
>  	wbinvd_on_all_cpus();
>
> +	/*
> +	 * Starting from this point the system may have TDX private
> +	 * memory.  Make it globally visible so tdx_reset_memory() only
> +	 * reads TDMRs/PAMTs when they are stable.
> +	 *
> +	 * Note using atomic_inc_return() to provide the explicit memory
> +	 * ordering isn't mandatory here as the WBINVD above already

WBINVD is serial instruction to make sure all things happen before
it must be committed before finish the exection of this instruction,
but it should not impact the instructions after it.
(SDM Vol.3 9.3 Jun 2023)

I think the atomic operation used below is to make sure the
change to tdx_may_has_private_mem becomes visible immediately
to other LPs which read it, e.g running tdx_reset_memory().
atomic_inc() should be enough for this case because the
locked Instructions are total order.
(SDM Vol.3 9.2.3.8 June 2023).

So per my understanding the key here is the atomic
operation's guarantee on memory changes visibility, not the
guarantee from WBINVD, the comment should be changed if
this is the correct understanding.

> +	 * does that.  Compiler barrier isn't needed here either.
> +	 */
> +	atomic_inc_return(&tdx_may_has_private_mem);
> +
>  	/* Config the key of global KeyID on all packages */
>  	ret = config_global_keyid();
>  	if (ret)
> @@ -1154,6 +1167,15 @@ static int init_tdx_module(void)
>  	 * as suggested by the TDX spec.
>  	 */
>  	tdmrs_reset_pamt_all(&tdx_tdmr_list);
> +	/*
> +	 * No more TDX private pages now, and PAMTs/TDMRs are
> +	 * going to be freed.  Make this globally visible so
> +	 * tdx_reset_memory() can read stable TDMRs/PAMTs.
> +	 *
> +	 * Note atomic_dec_return(), which is an atomic RMW with
> +	 * return value, always enforces the memory barrier.
> +	 */
> +	atomic_dec_return(&tdx_may_has_private_mem);
>  out_free_pamts:
>  	tdmrs_free_pamt_all(&tdx_tdmr_list);
>  out_free_tdmrs:
> @@ -1229,6 +1251,63 @@ int tdx_enable(void)
>  }
>  EXPORT_SYMBOL_GPL(tdx_enable);
>
> +/*
> + * Convert TDX private pages back to normal on platforms with
> + * "partial write machine check" erratum.
> + *
> + * Called from machine_kexec() before booting to the new kernel.
> + */
> +void tdx_reset_memory(void)
> +{
> +	if (!platform_tdx_enabled())
> +		return;
> +
> +	/*
> +	 * Kernel read/write to TDX private memory doesn't
> +	 * cause machine check on hardware w/o this erratum.
> +	 */
> +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> +		return;
> +
> +	/* Called from kexec() when only rebooting cpu is alive */
> +	WARN_ON_ONCE(num_online_cpus() != 1);
> +
> +	if (!atomic_read(&tdx_may_has_private_mem))
> +		return;
> +
> +	/*
> +	 * Ideally it's better to cover all types of TDX private pages,
> +	 * but there's no existing infrastructure to tell whether a page
> +	 * is TDX private memory or not.  Using SEAMCALL to query TDX
> +	 * module isn't feasible either because: 1) VMX has been turned
> +	 * off by reaching here so SEAMCALL cannot be made; 2) Even
> +	 * SEAMCALL can be made the result from TDX module may not be
> +	 * accurate (e.g., remote CPU can be stopped while the kernel
> +	 * is in the middle of reclaiming one TDX private page and doing
> +	 * MOVDIR64B).
> +	 *
> +	 * One solution could be just converting all memory pages, but
> +	 * this may bring non-trivial latency on large memory systems
> +	 * (especially when the number of TDX private pages is small).
> +	 * So even with this temporary solution, eventually the kernel
> +	 * should only convert TDX private pages.
> +	 *
> +	 * Also, not all pages are mapped as writable in direct mapping,
> +	 * thus it's problematic to do so.  It can be done by switching
> +	 * to the identical mapping table for kexec() or a new page table
> +	 * which maps all pages as writable, but the complexity looks
> +	 * overkill.
> +	 *
> +	 * Thus instead of doing something dramatic to convert all pages,
> +	 * only convert PAMTs as for now TDX private pages can only be
> +	 * PAMT.
> +	 *
> +	 * All other cpus are already dead.  TDMRs/PAMTs are stable when
> +	 * @tdx_may_has_private_mem reads true.
> +	 */
> +	tdmrs_reset_pamt_all(&tdx_tdmr_list);
> +}
> +
>  static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
>  					    u32 *nr_tdx_keyids)
>  {
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum
  2023-06-26 14:12 ` [PATCH v12 21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum Kai Huang
  2023-06-28 12:38   ` kirill.shutemov
@ 2023-07-07  7:26   ` Yuan Yao
  1 sibling, 0 replies; 159+ messages in thread
From: Yuan Yao @ 2023-07-07  7:26 UTC (permalink / raw)
  To: Kai Huang
  Cc: linux-kernel, kvm, linux-mm, x86, dave.hansen, kirill.shutemov,
	tony.luck, peterz, tglx, bp, mingo, hpa, seanjc, pbonzini, david,
	dan.j.williams, rafael.j.wysocki, ashok.raj, reinette.chatre,
	len.brown, ak, isaku.yamahata, ying.huang, chao.gao,
	sathyanarayanan.kuppuswamy, nik.borisov, bagasdotme, sagis,
	imammedo

On Tue, Jun 27, 2023 at 02:12:51AM +1200, Kai Huang wrote:
> The first few generations of TDX hardware have an erratum.  Triggering
> it in Linux requires some kind of kernel bug involving relatively exotic
> memory writes to TDX private memory and will manifest via
> spurious-looking machine checks when reading the affected memory.
>
> == Background ==
>
> Virtually all kernel memory accesses operations happen in full
> cachelines.  In practice, writing a "byte" of memory usually reads a 64
> byte cacheline of memory, modifies it, then writes the whole line back.
> Those operations do not trigger this problem.
>
> This problem is triggered by "partial" writes where a write transaction
> of less than cacheline lands at the memory controller.  The CPU does
> these via non-temporal write instructions (like MOVNTI), or through
> UC/WC memory mappings.  The issue can also be triggered away from the
> CPU by devices doing partial writes via DMA.
>
> == Problem ==
>
> A partial write to a TDX private memory cacheline will silently "poison"
> the line.  Subsequent reads will consume the poison and generate a
> machine check.  According to the TDX hardware spec, neither of these
> things should have happened.
>
> To add insult to injury, the Linux machine code will present these as a
> literal "Hardware error" when they were, in fact, a software-triggered
> issue.
>
> == Solution ==
>
> In the end, this issue is hard to trigger.  Rather than do something
> rash (and incomplete) like unmap TDX private memory from the direct map,
> improve the machine check handler.
>
> Currently, the #MC handler doesn't distinguish whether the memory is
> TDX private memory or not but just dump, for instance, below message:
>
>  [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
>  [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
>  	...
>  [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
>  [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
>  [...] Kernel panic - not syncing: Fatal local machine check
>
> Which says "Hardware Error" and "Data load in unrecoverable area of
> kernel".
>
> Ideally, it's better for the log to say "software bug around TDX private
> memory" instead of "Hardware Error".  But in reality the real hardware
> memory error can happen, and sadly such software-triggered #MC cannot be
> distinguished from the real hardware error.  Also, the error message is
> used by userspace tool 'mcelog' to parse, so changing the output may
> break userspace.
>
> So keep the "Hardware Error".  The "Data load in unrecoverable area of
> kernel" is also helpful, so keep it too.
>
> Instead of modifying above error log, improve the error log by printing
> additional TDX related message to make the log like:
>
>   ...
>  [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
>  [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.
>
> Adding this additional message requires determination of whether the
> memory page is TDX private memory.  There is no existing infrastructure
> to do that.  Add an interface to query the TDX module to fill this gap.
>
> == Impact ==
>
> This issue requires some kind of kernel bug to trigger.
>
> TDX private memory should never be mapped UC/WC.  A partial write
> originating from these mappings would require *two* bugs, first mapping
> the wrong page, then writing the wrong memory.  It would also be
> detectable using traditional memory corruption techniques like
> DEBUG_PAGEALLOC.
>
> MOVNTI (and friends) could cause this issue with something like a simple
> buffer overrun or use-after-free on the direct map.  It should also be
> detectable with normal debug techniques.
>
> The one place where this might get nasty would be if the CPU read data
> then wrote back the same data.  That would trigger this problem but
> would not, for instance, set off mechanisms like slab redzoning because
> it doesn't actually corrupt data.
>
> With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
> TDX private memory would first need to be incorrectly mapped into the
> I/O space and then a later DMA to that mapping would actually cause the
> poisoning event.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>

>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>
> v11 -> v12:
>  - Simplified #MC message (Dave/Kirill)
>  - Slightly improved some comments.
>
> v10 -> v11:
>  - New patch
>
>
> ---
>  arch/x86/include/asm/tdx.h     |   2 +
>  arch/x86/kernel/cpu/mce/core.c |  33 +++++++++++
>  arch/x86/virt/vmx/tdx/tdx.c    | 102 +++++++++++++++++++++++++++++++++
>  arch/x86/virt/vmx/tdx/tdx.h    |   5 ++
>  4 files changed, 142 insertions(+)
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 8d3f85bcccc1..a697b359d8c6 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -106,11 +106,13 @@ bool platform_tdx_enabled(void);
>  int tdx_cpu_enable(void);
>  int tdx_enable(void);
>  void tdx_reset_memory(void);
> +bool tdx_is_private_mem(unsigned long phys);
>  #else	/* !CONFIG_INTEL_TDX_HOST */
>  static inline bool platform_tdx_enabled(void) { return false; }
>  static inline int tdx_cpu_enable(void) { return -ENODEV; }
>  static inline int tdx_enable(void)  { return -ENODEV; }
>  static inline void tdx_reset_memory(void) { }
> +static inline bool tdx_is_private_mem(unsigned long phys) { return false; }
>  #endif	/* CONFIG_INTEL_TDX_HOST */
>
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 2eec60f50057..f71b649f4c82 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -52,6 +52,7 @@
>  #include <asm/mce.h>
>  #include <asm/msr.h>
>  #include <asm/reboot.h>
> +#include <asm/tdx.h>
>
>  #include "internal.h"
>
> @@ -228,11 +229,34 @@ static void wait_for_panic(void)
>  	panic("Panicing machine check CPU died");
>  }
>
> +static const char *mce_memory_info(struct mce *m)
> +{
> +	if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
> +		return NULL;
> +
> +	/*
> +	 * Certain initial generations of TDX-capable CPUs have an
> +	 * erratum.  A kernel non-temporal partial write to TDX private
> +	 * memory poisons that memory, and a subsequent read of that
> +	 * memory triggers #MC.
> +	 *
> +	 * However such #MC caused by software cannot be distinguished
> +	 * from the real hardware #MC.  Just print additional message
> +	 * to show such #MC may be result of the CPU erratum.
> +	 */
> +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> +		return NULL;
> +
> +	return !tdx_is_private_mem(m->addr) ? NULL :
> +		"TDX private memory error. Possible kernel bug.";
> +}
> +
>  static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
>  {
>  	struct llist_node *pending;
>  	struct mce_evt_llist *l;
>  	int apei_err = 0;
> +	const char *memmsg;
>
>  	/*
>  	 * Allow instrumentation around external facilities usage. Not that it
> @@ -283,6 +307,15 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
>  	}
>  	if (exp)
>  		pr_emerg(HW_ERR "Machine check: %s\n", exp);
> +	/*
> +	 * Confidential computing platforms such as TDX platforms
> +	 * may occur MCE due to incorrect access to confidential
> +	 * memory.  Print additional information for such error.
> +	 */
> +	memmsg = mce_memory_info(final);
> +	if (memmsg)
> +		pr_emerg(HW_ERR "Machine check: %s\n", memmsg);
> +
>  	if (!fake_panic) {
>  		if (panic_timeout == 0)
>  			panic_timeout = mca_cfg.panic_timeout;
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index eba7ff91206d..5f96c2d866e5 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1315,6 +1315,108 @@ void tdx_reset_memory(void)
>  	tdmrs_reset_pamt_all(&tdx_tdmr_list);
>  }
>
> +static bool is_pamt_page(unsigned long phys)
> +{
> +	struct tdmr_info_list *tdmr_list = &tdx_tdmr_list;
> +	int i;
> +
> +	/*
> +	 * This function is called from #MC handler, and theoretically
> +	 * it could run in parallel with the TDX module initialization
> +	 * on other logical cpus.  But it's not OK to hold mutex here
> +	 * so just blindly check module status to make sure PAMTs/TDMRs
> +	 * are stable to access.
> +	 *
> +	 * This may return inaccurate result in rare cases, e.g., when
> +	 * #MC happens on a PAMT page during module initialization, but
> +	 * this is fine as #MC handler doesn't need a 100% accurate
> +	 * result.
> +	 */
> +	if (tdx_module_status != TDX_MODULE_INITIALIZED)
> +		return false;
> +
> +	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> +		unsigned long base, size;
> +
> +		tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
> +
> +		if (phys >= base && phys < (base + size))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +/*
> + * Return whether the memory page at the given physical address is TDX
> + * private memory or not.  Called from #MC handler do_machine_check().
> + *
> + * Note this function may not return an accurate result in rare cases.
> + * This is fine as the #MC handler doesn't need a 100% accurate result,
> + * because it cannot distinguish #MC between software bug and real
> + * hardware error anyway.
> + */
> +bool tdx_is_private_mem(unsigned long phys)
> +{
> +	struct tdx_module_output out;
> +	u64 sret;
> +
> +	if (!platform_tdx_enabled())
> +		return false;
> +
> +	/* Get page type from the TDX module */
> +	sret = __seamcall(TDH_PHYMEM_PAGE_RDMD, phys & PAGE_MASK,
> +			0, 0, 0, &out);
> +	/*
> +	 * Handle the case that CPU isn't in VMX operation.
> +	 *
> +	 * KVM guarantees no VM is running (thus no TDX guest)
> +	 * when there's any online CPU isn't in VMX operation.
> +	 * This means there will be no TDX guest private memory
> +	 * and Secure-EPT pages.  However the TDX module may have
> +	 * been initialized and the memory page could be PAMT.
> +	 */
> +	if (sret == TDX_SEAMCALL_UD)
> +		return is_pamt_page(phys);
> +
> +	/*
> +	 * Any other failure means:
> +	 *
> +	 * 1) TDX module not loaded; or
> +	 * 2) Memory page isn't managed by the TDX module.
> +	 *
> +	 * In either case, the memory page cannot be a TDX
> +	 * private page.
> +	 */
> +	if (sret)
> +		return false;
> +
> +	/*
> +	 * SEAMCALL was successful -- read page type (via RCX):
> +	 *
> +	 *  - PT_NDA:	Page is not used by the TDX module
> +	 *  - PT_RSVD:	Reserved for Non-TDX use
> +	 *  - Others:	Page is used by the TDX module
> +	 *
> +	 * Note PAMT pages are marked as PT_RSVD but they are also TDX
> +	 * private memory.
> +	 *
> +	 * Note: Even page type is PT_NDA, the memory page could still
> +	 * be associated with TDX private KeyID if the kernel hasn't
> +	 * explicitly used MOVDIR64B to clear the page.  Assume KVM
> +	 * always does that after reclaiming any private page from TDX
> +	 * gusets.
> +	 */
> +	switch (out.rcx) {
> +	case PT_NDA:
> +		return false;
> +	case PT_RSVD:
> +		return is_pamt_page(phys);
> +	default:
> +		return true;
> +	}
> +}
> +
>  static int __init record_keyid_partitioning(u32 *tdx_keyid_start,
>  					    u32 *nr_tdx_keyids)
>  {
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index f6b4e153890d..2fefd688924c 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -21,6 +21,7 @@
>  /*
>   * TDX module SEAMCALL leaf functions
>   */
> +#define TDH_PHYMEM_PAGE_RDMD	24
>  #define TDH_SYS_KEY_CONFIG	31
>  #define TDH_SYS_INFO		32
>  #define TDH_SYS_INIT		33
> @@ -28,6 +29,10 @@
>  #define TDH_SYS_TDMR_INIT	36
>  #define TDH_SYS_CONFIG		45
>
> +/* TDX page types */
> +#define	PT_NDA		0x0
> +#define	PT_RSVD		0x1
> +
>  struct cmr_info {
>  	u64	base;
>  	u64	size;
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand
  2023-07-06 14:49                       ` Dave Hansen
@ 2023-07-10 17:58                         ` Sean Christopherson
  0 siblings, 0 replies; 159+ messages in thread
From: Sean Christopherson @ 2023-07-10 17:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Isaku Yamahata, Kai Huang, kvm, Ashok Raj,
	Tony Luck, david, bagasdotme, ak, Rafael J Wysocki,
	kirill.shutemov, Reinette Chatre, pbonzini, mingo, tglx,
	linux-kernel, linux-mm, Isaku Yamahata, nik.borisov, hpa,
	Sagi Shahar, imammedo, bp, Chao Gao, Len Brown,
	sathyanarayanan.kuppuswamy, Ying Huang, Dan J Williams, x86

On Thu, Jul 06, 2023, Dave Hansen wrote:
> On 7/5/23 07:57, Peter Zijlstra wrote:
> > On Wed, Jul 05, 2023 at 07:34:06AM -0700, Dave Hansen wrote:
> >> On 7/4/23 09:58, Peter Zijlstra wrote:
> >>> If we have concerns about allocating the PAMT array, can't we use CMA
> >>> for this? Allocate the whole thing at boot as CMA such that when not
> >>> used for TDX it can be used for regular things like userspace and
> >>> filecache pages?
> >> I never thought of CMA as being super reliable.  Maybe it's improved
> >> over the years.
> >>
> >> KVM also has a rather nasty habit of pinning pages, like for device
> >> passthrough.  I suspect that means that we'll have one of two scenarios:
> >>
> >>  1. CMA works great, but the TDX/CMA area is unusable for KVM because
> >>     it's pinning all its pages and they just get moved out of the CMA
> >>     area immediately.  The CMA area is effectively wasted.
> >>  2. CMA sucks, and users get sporadic TDX failures when they wait a long
> >>     time to run a TDX guest after boot.  Users just work around the CMA
> >>     support by starting up TDX guests at boot or demanding a module
> >>     parameter be set.  Hacking in CMA support was a waste.
> >>
> >> Am I just too much of a pessimist?
> > Well, if CMA still sucks, then that needs fixing. If CMA works, but we
> > have a circular fail in that KVM needs to long-term pin the PAMT pages
> > but long-term pin is evicted from CMA (the whole point of long-term pin,
> > after all), then surely we can break that cycle somehow, since in this
> > case the purpose of the CMA is being able to grab that memory chunk when
> > we needs it.
> > 
> > That is, either way around is just a matter of a little code, no?
> 
> It's not a circular dependency, it's conflicting requirements.
> 
> CMA makes memory more available, but only in the face of unpinned pages.
> 
> KVM can pin lots of pages, even outside of TDX-based VMs.
> 
> So we either need to change how CMA works fundamentally or stop KVM from
> pinning pages.

Nit, I think you're conflating KVM with VFIO and/or IOMMU code.  Device passhthrough
does pin large chunks of memory, but KVM itself isn't involved or even aware of
the pins.

HugeTLB is another case where CMA will be effectively used to serve guest memory,
but again KVM isn't the thing doing the pinning.

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 09/22] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-06-26 14:12 ` [PATCH v12 09/22] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
  2023-06-28 14:17   ` Peter Zijlstra
@ 2023-07-11 11:38   ` David Hildenbrand
  2023-07-11 12:27     ` Huang, Kai
  1 sibling, 1 reply; 159+ messages in thread
From: David Hildenbrand @ 2023-07-11 11:38 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo



[...]

> +/* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
> +static LIST_HEAD(tdx_memlist);
> +
>   /*
>    * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
>    * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> @@ -204,6 +214,79 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
>   	return 0;
>   }
>   
> +/*
> + * Add a memory region as a TDX memory block.  The caller must make sure
> + * all memory regions are added in address ascending order and don't
> + * overlap.
> + */
> +static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> +			    unsigned long end_pfn)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> +	if (!tmb)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&tmb->list);
> +	tmb->start_pfn = start_pfn;
> +	tmb->end_pfn = end_pfn;
> +
> +	/* @tmb_list is protected by mem_hotplug_lock */

If the list is static and independent of memory hotplug, why does it 
have to be protected?

I assume because the memory notifier might currently trigger before 
building the list.

Not sure if that is the right approach. See below.

> +	list_add_tail(&tmb->list, tmb_list);
> +	return 0;
> +}
> +
> +static void free_tdx_memlist(struct list_head *tmb_list)
> +{
> +	/* @tmb_list is protected by mem_hotplug_lock */
> +	while (!list_empty(tmb_list)) {
> +		struct tdx_memblock *tmb = list_first_entry(tmb_list,
> +				struct tdx_memblock, list);
> +
> +		list_del(&tmb->list);
> +		kfree(tmb);
> +	}
> +}
> +
> +/*
> + * Ensure that all memblock memory regions are convertible to TDX
> + * memory.  Once this has been established, stash the memblock
> + * ranges off in a secondary structure because memblock is modified
> + * in memory hotplug while TDX memory regions are fixed.
> + */
> +static int build_tdx_memlist(struct list_head *tmb_list)
> +{
> +	unsigned long start_pfn, end_pfn;
> +	int i, ret;
> +
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> +		/*
> +		 * The first 1MB is not reported as TDX convertible memory.
> +		 * Although the first 1MB is always reserved and won't end up
> +		 * to the page allocator, it is still in memblock's memory
> +		 * regions.  Skip them manually to exclude them as TDX memory.
> +		 */
> +		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
> +		if (start_pfn >= end_pfn)
> +			continue;
> +
> +		/*
> +		 * Add the memory regions as TDX memory.  The regions in
> +		 * memblock has already guaranteed they are in address
> +		 * ascending order and don't overlap.
> +		 */
> +		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
> +		if (ret)
> +			goto err;
> +	}

So at the time init_tdx_module() is called, you simply go over all 
memblocks.

But how can you be sure that they are TDX-capable?

While the memory notifier will deny onlining new memory blocks, 
add_memory() already happened and added a new memory block to the system 
(and to memblock). See add_memory_resource().

It might be cleaner to build the list once during module init (before 
any memory hotplug can happen and before we tear down memblock) and not 
require ARCH_KEEP_MEMBLOCK. Essentially, before registering the 
notifier. So the list is really static.

But maybe I am missing something.

> +
> +	return 0;
> +err:
> +	free_tdx_memlist(tmb_list);
> +	return ret;
> +}
> +
>   static int init_tdx_module(void)
>   {
>   	struct tdsysinfo_struct *sysinfo;
> @@ -230,10 +313,25 @@ static int init_tdx_module(void)
>   	if (ret)
>   		goto out;

[...]

>   
> +struct tdx_memblock {
> +	struct list_head list;
> +	unsigned long start_pfn;
> +	unsigned long end_pfn;
> +};

If it's never consumed by someone else, maybe keep it local to the c file?

> +
>   struct tdx_module_output;
>   u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>   	       struct tdx_module_output *out);

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2023-06-26 14:12 ` [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
  2023-06-27  9:51   ` kirill.shutemov
  2023-07-04  7:40   ` Yuan Yao
@ 2023-07-11 11:42   ` David Hildenbrand
  2023-07-11 11:49     ` Huang, Kai
  2 siblings, 1 reply; 159+ messages in thread
From: David Hildenbrand @ 2023-07-11 11:42 UTC (permalink / raw)
  To: Kai Huang, linux-kernel, kvm
  Cc: linux-mm, x86, dave.hansen, kirill.shutemov, tony.luck, peterz,
	tglx, bp, mingo, hpa, seanjc, pbonzini, dan.j.williams,
	rafael.j.wysocki, ashok.raj, reinette.chatre, len.brown, ak,
	isaku.yamahata, ying.huang, chao.gao, sathyanarayanan.kuppuswamy,
	nik.borisov, bagasdotme, sagis, imammedo

On 26.06.23 16:12, Kai Huang wrote:
> The TDX module uses additional metadata to record things like which
> guest "owns" a given page of memory.  This metadata, referred as
> Physical Address Metadata Table (PAMT), essentially serves as the
> 'struct page' for the TDX module.  PAMTs are not reserved by hardware
> up front.  They must be allocated by the kernel and then given to the
> TDX module during module initialization.
> 
> TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
> (TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
> be a physically contiguous area from a Convertible Memory Region (CMR).
> However, the PAMTs which track pages in one TDMR do not need to reside
> within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
> any TDMR, the overlapping part must be reported as a reserved area in
> that particular TDMR.
> 
> Use alloc_contig_pages() since PAMT must be a physically contiguous area
> and it may be potentially large (~1/256th of the size of the given TDMR).
> The downside is alloc_contig_pages() may fail at runtime.  One (bad)
> mitigation is to launch a TDX guest early during system boot to get
> those PAMTs allocated at early time, but the only way to fix is to add a
> boot option to allocate or reserve PAMTs during kernel boot.
> 
> It is imperfect but will be improved on later.
> 
> TDX only supports a limited number of reserved areas per TDMR to cover
> both PAMTs and memory holes within the given TDMR.  If many PAMTs are
> allocated within a single TDMR, the reserved areas may not be sufficient
> to cover all of them.
> 
> Adopt the following policies when allocating PAMTs for a given TDMR:
> 
>    - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
>      the total number of reserved areas consumed for PAMTs.
>    - Try to first allocate PAMT from the local node of the TDMR for better
>      NUMA locality.
> 
> Also dump out how many pages are allocated for PAMTs when the TDX module
> is initialized successfully.  This helps answer the eternal "where did
> all my memory go?" questions.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
> v11 -> v12:
>   - Moved TDX_PS_NUM from tdx.c to <asm/tdx.h> (Kirill)
>   - "<= TDX_PS_1G" -> "< TDX_PS_NUM" (Kirill)
>   - Changed tdmr_get_pamt() to return base and size instead of base_pfn
>     and npages and related code directly (Dave).
>   - Simplified PAMT kb counting. (Dave)
>   - tdmrs_count_pamt_pages() -> tdmr_count_pamt_kb() (Kirill/Dave)
> 
> v10 -> v11:
>   - No update
> 
> v9 -> v10:
>   - Removed code change in disable_tdx_module() as it doesn't exist
>     anymore.
> 
> v8 -> v9:
>   - Added TDX_PS_NR macro instead of open-coding (Dave).
>   - Better alignment of 'pamt_entry_size' in tdmr_set_up_pamt() (Dave).
>   - Changed to print out PAMTs in "KBs" instead of "pages" (Dave).
>   - Added Dave's Reviewed-by.
> 
> v7 -> v8: (Dave)
>   - Changelog:
>    - Added a sentence to state PAMT allocation will be improved.
>    - Others suggested by Dave.
>   - Moved 'nid' of 'struct tdx_memblock' to this patch.
>   - Improved comments around tdmr_get_nid().
>   - WARN_ON_ONCE() -> pr_warn() in tdmr_get_nid().
>   - Other changes due to 'struct tdmr_info_list'.
> 
> v6 -> v7:
>   - Changes due to using macros instead of 'enum' for TDX supported page
>     sizes.
> 
> v5 -> v6:
>   - Rebase due to using 'tdx_memblock' instead of memblock.
>   - 'int pamt_entry_nr' -> 'unsigned long nr_pamt_entries' (Dave/Sagis).
>   - Improved comment around tdmr_get_nid() (Dave).
>   - Improved comment in tdmr_set_up_pamt() around breaking the PAMT
>     into PAMTs for 4K/2M/1G (Dave).
>   - tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave).
> 
> - v3 -> v5 (no feedback on v4):
>   - Used memblock to get the NUMA node for given TDMR.
>   - Removed tdmr_get_pamt_sz() helper but use open-code instead.
>   - Changed to use 'switch .. case..' for each TDX supported page size in
>     tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()).
>   - Added printing out memory used for PAMT allocation when TDX module is
>     initialized successfully.
>   - Explained downside of alloc_contig_pages() in changelog.
>   - Addressed other minor comments.
> 
> 
> ---
>   arch/x86/Kconfig            |   1 +
>   arch/x86/include/asm/tdx.h  |   1 +
>   arch/x86/virt/vmx/tdx/tdx.c | 215 +++++++++++++++++++++++++++++++++++-
>   arch/x86/virt/vmx/tdx/tdx.h |   1 +
>   4 files changed, 213 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2226d8a4c749..ad364f01de33 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
>   	depends on KVM_INTEL
>   	depends on X86_X2APIC
>   	select ARCH_KEEP_MEMBLOCK
> +	depends on CONTIG_ALLOC
>   	help
>   	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>   	  host and certain physical attacks.  This option enables necessary TDX
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index d8226a50c58c..91416fd600cd 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -24,6 +24,7 @@
>   #define TDX_PS_4K	0
>   #define TDX_PS_2M	1
>   #define TDX_PS_1G	2
> +#define TDX_PS_NR	(TDX_PS_1G + 1)
>   
>   /*
>    * Used to gather the output registers values of the TDCALL and SEAMCALL
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 2ffc1517a93b..fd5417577f26 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -221,7 +221,7 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
>    * overlap.
>    */
>   static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> -			    unsigned long end_pfn)
> +			    unsigned long end_pfn, int nid)
>   {
>   	struct tdx_memblock *tmb;
>   
> @@ -232,6 +232,7 @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
>   	INIT_LIST_HEAD(&tmb->list);
>   	tmb->start_pfn = start_pfn;
>   	tmb->end_pfn = end_pfn;
> +	tmb->nid = nid;
>   
>   	/* @tmb_list is protected by mem_hotplug_lock */
>   	list_add_tail(&tmb->list, tmb_list);
> @@ -259,9 +260,9 @@ static void free_tdx_memlist(struct list_head *tmb_list)
>   static int build_tdx_memlist(struct list_head *tmb_list)
>   {
>   	unsigned long start_pfn, end_pfn;
> -	int i, ret;
> +	int i, nid, ret;
>   
> -	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
>   		/*
>   		 * The first 1MB is not reported as TDX convertible memory.
>   		 * Although the first 1MB is always reserved and won't end up
> @@ -277,7 +278,7 @@ static int build_tdx_memlist(struct list_head *tmb_list)
>   		 * memblock has already guaranteed they are in address
>   		 * ascending order and don't overlap.
>   		 */
> -		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
> +		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
>   		if (ret)
>   			goto err;

Why did you decide to defer remembering the nid as well? I'd just move 
that part to the patch that adds add_tdx_memblock().

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2023-07-11 11:42   ` David Hildenbrand
@ 2023-07-11 11:49     ` Huang, Kai
  2023-07-11 11:55       ` David Hildenbrand
  0 siblings, 1 reply; 159+ messages in thread
From: Huang, Kai @ 2023-07-11 11:49 UTC (permalink / raw)
  To: kvm, linux-kernel, david
  Cc: Raj, Ashok, Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki,
	Rafael J, kirill.shutemov, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, linux-mm, tglx, Yamahata, Isaku, mingo,
	nik.borisov, hpa, peterz, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On Tue, 2023-07-11 at 13:42 +0200, David Hildenbrand wrote:
> > -		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
> > +		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
> >    		if (ret)
> >    			goto err;
> 
> Why did you decide to defer remembering the nid as well? I'd just move 
> that part to the patch that adds add_tdx_memblock().

Thanks for the review.

The @nid is used to try to allocate the PAMT from local node.  It only gets used
in this patch.  Originally (in v7) I had it in patch 09 but Dave suggested to
move to this patch (see the first comment in below link):

https://lore.kernel.org/lkml/8e6803f5-bec6-843d-f3c4-75006ffd0d2f@intel.com/

^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
  2023-07-11 11:49     ` Huang, Kai
@ 2023-07-11 11:55       ` David Hildenbrand
  0 siblings, 0 replies; 159+ messages in thread
From: David Hildenbrand @ 2023-07-11 11:55 UTC (permalink / raw)
  To: Huang, Kai, kvm, linux-kernel
  Cc: Raj, Ashok, Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki,
	Rafael J, kirill.shutemov, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, linux-mm, tglx, Yamahata, Isaku, mingo,
	nik.borisov, hpa, peterz, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On 11.07.23 13:49, Huang, Kai wrote:
> On Tue, 2023-07-11 at 13:42 +0200, David Hildenbrand wrote:
>>> -		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
>>> +		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid);
>>>     		if (ret)
>>>     			goto err;
>>
>> Why did you decide to defer remembering the nid as well? I'd just move
>> that part to the patch that adds add_tdx_memblock().
> 
> Thanks for the review.
> 
> The @nid is used to try to allocate the PAMT from local node.  It only gets used
> in this patch.  Originally (in v7) I had it in patch 09 but Dave suggested to
> move to this patch (see the first comment in below link):
> 
> https://lore.kernel.org/lkml/8e6803f5-bec6-843d-f3c4-75006ffd0d2f@intel.com/

Okay, thanks.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 159+ messages in thread

* Re: [PATCH v12 09/22] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
  2023-07-11 11:38   ` David Hildenbrand
@ 2023-07-11 12:27     ` Huang, Kai
  0 siblings, 0 replies; 159+ messages in thread
From: Huang, Kai @ 2023-07-11 12:27 UTC (permalink / raw)
  To: kvm, linux-kernel, david
  Cc: Raj, Ashok, Hansen, Dave, Luck, Tony, bagasdotme, ak, Wysocki,
	Rafael J, kirill.shutemov, Chatre, Reinette, Christopherson,,
	Sean, pbonzini, linux-mm, tglx, Yamahata, Isaku, mingo,
	nik.borisov, hpa, peterz, Shahar, Sagi, imammedo, bp, Gao, Chao,
	Brown, Len, sathyanarayanan.kuppuswamy, Huang, Ying, Williams,
	Dan J, x86

On Tue, 2023-07-11 at 13:38 +0200, David Hildenbrand wrote:
> 
> [...]
> 
> > +/* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
> > +static LIST_HEAD(tdx_memlist);
> > +
> >   /*
> >    * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> >    * to kernel error code.  @seamcall_ret and @out contain the SEAMCALL
> > @@ -204,6 +214,79 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
> >   	return 0;
> >   }
> >   
> > +/*
> > + * Add a memory region as a TDX memory block.  The caller must make sure
> > + * all memory regions are added in address ascending order and don't
> > + * overlap.
> > + */
> > +static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
> > +			    unsigned long end_pfn)
> > +{
> > +	struct tdx_memblock *tmb;
> > +
> > +	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> > +	if (!tmb)
> > +		return -ENOMEM;
> > +
> > +	INIT_LIST_HEAD(&tmb->list);
> > +	tmb->start_pfn = start_pfn;
> > +	tmb->end_pfn = end_pfn;
> > +
> > +	/* @tmb_list is protected by mem_hotplug_lock */
> 
> If the list is static and independent of memory hotplug, why does it 
> have to be protected?

Thanks for review!

The @tdx_memlist itself is a static variable, but the elements in the list are
built during module initialization, so we need to protect the list from memory
hotplug code path.

> 
> I assume because the memory notifier might currently trigger before 
> building the list.
> 
> Not sure if that is the right approach. See below.
> 
> > +	list_add_tail(&tmb->list, tmb_list);
> > +	return 0;
> > +}
> > +
> > +static void free_tdx_memlist(struct list_head *tmb_list)
> > +{
> > +	/* @tmb_list is protected by mem_hotplug_lock */
> > +	while (!list_empty(tmb_list)) {
> > +		struct tdx_memblock *tmb = list_first_entry(tmb_list,
> > +				struct tdx_memblock, list);
> > +
> > +		list_del(&tmb->list);
> > +		kfree(tmb);
> > +	}
> > +}
> > +
> > +/*
> > + * Ensure that all memblock memory regions are convertible to TDX
> > + * memory.  Once this has been established, stash the memblock
> > + * ranges off in a secondary structure because memblock is modified
> > + * in memory hotplug while TDX memory regions are fixed.
> > + */
> > +static int build_tdx_memlist(struct list_head *tmb_list)
> > +{
> > +	unsigned long start_pfn, end_pfn;
> > +	int i, ret;
> > +
> > +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> > +		/*
> > +		 * The first 1MB is not reported as TDX convertible memory.
> > +		 * Although the first 1MB is always reserved and won't end up
> > +		 * to the page allocator, it is still in memblock's memory
> > +		 * regions.  Skip them manually to exclude them as TDX memory.
> > +		 */
> > +		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
> > +		if (start_pfn >= end_pfn)
> > +			continue;
> > +
> > +		/*
> > +		 * Add the memory regions as TDX memory.  The regions in
> > +		 * memblock has already guaranteed they are in address
> > +		 * ascending order and don't overlap.
> > +		 */
> > +		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
> > +		if (ret)
> > +			goto err;
> > +	}
> 
> So at the time init_tdx_module() is called, you simply go over all 
> memblocks.
> 
> But how can you be sure that they are TDX-capable?

If any memory isn't TDX-capable, the later SEAMCALL TDH.SYS.CONFIG will fail. 
There's no explicit check to see whether all memblocks are within CMRs here, but
depends on the TDH.SYS.CONFIG to do that.  This is mainly for code simplicity.

> 
> While the memory notifier will deny onlining new memory blocks, 
> add_memory() already happened and added a new memory block to the system 
> (and to memblock). See add_memory_resource().

Yes but this is fine, as long as they are not "plugged" into the buddy system.

> 
> It might be cleaner to build the list once during module init (before 
> any memory hotplug can happen and before we tear down memblock) and not 
> require ARCH_KEEP_MEMBLOCK. Essentially, before registering the 
> notifier. So the list is really static.

This can be another solution.  In fact I tried this before.  But one problem is
when TDX module happens, some hot-added memory may already have been hot-added
and/or become online.  So during module initialization, we cannot simply pass
the TDX memblocks built during kernel boot to the TDX module, but need to verify
the current memblocks (this will ARCH_KEEP_MEMBLOCK) or the online memory_blocks
don't contain any memory that isn't in TDX memblocks.  To me this approach isn't
simpler than the current approach.

> 
> But maybe I am missing something.
> 
> > +
> > +	return 0;
> > +err:
> > +	free_tdx_memlist(tmb_list);
> > +	return ret;
> > +}
> > +
> >   static int init_tdx_module(void)
> >   {
> >   	struct tdsysinfo_struct *sysinfo;
> > @@ -230,10 +313,25 @@ static int init_tdx_module(void)
> >   	if (ret)
> >   		goto out;
> 
> [...]
> 
> >   
> > +struct tdx_memblock {
> > +	struct list_head list;
> > +	unsigned long start_pfn;
> > +	unsigned long end_pfn;
> > +};
> 
> If it's never consumed by someone else, maybe keep it local to the c file?

We can, and actually I did this in the old versions, but I changed to put here
because there's another structure 'struct tdmr_info_list' being added in later
patch.  Also, if we move this structure to .c file, then we should move all
kernel-defined structures/type declarations to the .c file too (for those
architecture structures I want to keep them in tdx.h as they are lengthy and can
be used by KVM in the future).  I somehow found it's not easy to read too.

But I am fine with either way.

Kirill/Dave, do you have any comments?

> 
> > +
> >   struct tdx_module_output;
> >   u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> >   	       struct tdx_module_output *out);
> 


^ permalink raw reply	[flat|nested] 159+ messages in thread

end of thread, other threads:[~2023-07-11 12:31 UTC | newest]

Thread overview: 159+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-26 14:12 [PATCH v12 00/22] TDX host kernel support Kai Huang
2023-06-26 14:12 ` [PATCH v12 01/22] x86/tdx: Define TDX supported page sizes as macros Kai Huang
2023-06-26 14:12 ` [PATCH v12 02/22] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
2023-06-26 14:12 ` [PATCH v12 03/22] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Kai Huang
2023-06-26 14:12 ` [PATCH v12 04/22] x86/cpu: Detect TDX partial write machine check erratum Kai Huang
2023-06-29 11:22   ` David Hildenbrand
2023-06-26 14:12 ` [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure Kai Huang
2023-06-27  9:48   ` kirill.shutemov
2023-06-27 10:28     ` Huang, Kai
2023-06-27 11:36       ` kirill.shutemov
2023-06-28  0:19       ` Isaku Yamahata
2023-06-28  3:09   ` Chao Gao
2023-06-28  3:34     ` Huang, Kai
2023-06-28 11:50       ` kirill.shutemov
2023-06-28 23:31         ` Huang, Kai
2023-06-29 11:25       ` David Hildenbrand
2023-06-28 12:58   ` Peter Zijlstra
2023-06-28 13:54     ` Peter Zijlstra
2023-06-28 23:25       ` Huang, Kai
2023-06-29 10:15       ` kirill.shutemov
2023-06-28 23:21     ` Huang, Kai
2023-06-29  3:40       ` Huang, Kai
2023-06-26 14:12 ` [PATCH v12 06/22] x86/virt/tdx: Handle SEAMCALL running out of entropy error Kai Huang
2023-06-28 13:02   ` Peter Zijlstra
2023-06-28 23:30     ` Huang, Kai
2023-06-26 14:12 ` [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
2023-06-26 21:21   ` Sathyanarayanan Kuppuswamy
2023-06-27 10:37     ` Huang, Kai
2023-06-27  9:50   ` kirill.shutemov
2023-06-27 10:34     ` Huang, Kai
2023-06-27 12:18       ` kirill.shutemov
2023-06-27 22:37         ` Huang, Kai
2023-06-28  0:28           ` Huang, Kai
2023-06-28 11:55             ` kirill.shutemov
2023-06-28 13:35             ` Peter Zijlstra
2023-06-29  0:15               ` Huang, Kai
2023-06-30  9:22                 ` Peter Zijlstra
2023-06-30 10:09                   ` Huang, Kai
2023-06-30 18:42                     ` Isaku Yamahata
2023-07-01  8:15                     ` Huang, Kai
2023-06-28  0:31           ` Isaku Yamahata
2023-06-28 13:04   ` Peter Zijlstra
2023-06-29  0:00     ` Huang, Kai
2023-06-30  9:25       ` Peter Zijlstra
2023-06-30  9:48         ` Huang, Kai
2023-06-28 13:08   ` Peter Zijlstra
2023-06-29  0:08     ` Huang, Kai
2023-06-28 13:17   ` Peter Zijlstra
2023-06-29  0:10     ` Huang, Kai
2023-06-30  9:26       ` Peter Zijlstra
2023-06-30  9:55         ` Huang, Kai
2023-06-30 18:30           ` Peter Zijlstra
2023-06-30 19:05             ` Isaku Yamahata
2023-06-30 21:24               ` Sean Christopherson
2023-06-30 21:58                 ` Dan Williams
2023-06-30 23:13                 ` Dave Hansen
2023-07-03 10:38                   ` Peter Zijlstra
2023-07-03 10:49                 ` Peter Zijlstra
2023-07-03 14:40                   ` Dave Hansen
2023-07-03 15:03                     ` Peter Zijlstra
2023-07-03 15:26                       ` Dave Hansen
2023-07-03 17:55                       ` kirill.shutemov
2023-07-03 18:26                         ` Dave Hansen
2023-07-05  7:14                         ` Peter Zijlstra
2023-07-04 16:58                 ` Peter Zijlstra
2023-07-04 21:50                   ` Huang, Kai
2023-07-05  7:16                     ` Peter Zijlstra
2023-07-05  7:54                       ` Huang, Kai
2023-07-05 14:34                   ` Dave Hansen
2023-07-05 14:57                     ` Peter Zijlstra
2023-07-06 14:49                       ` Dave Hansen
2023-07-10 17:58                         ` Sean Christopherson
2023-06-29 11:31   ` David Hildenbrand
2023-06-29 22:58     ` Huang, Kai
2023-06-26 14:12 ` [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
2023-06-27  9:51   ` kirill.shutemov
2023-06-27 10:45     ` Huang, Kai
2023-06-27 11:37       ` kirill.shutemov
2023-06-27 11:46         ` Huang, Kai
2023-06-28 14:10   ` Peter Zijlstra
2023-06-29  9:15     ` Huang, Kai
2023-06-30  9:34       ` Peter Zijlstra
2023-06-30  9:58         ` Huang, Kai
2023-06-26 14:12 ` [PATCH v12 09/22] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
2023-06-28 14:17   ` Peter Zijlstra
2023-06-29  0:57     ` Huang, Kai
2023-07-11 11:38   ` David Hildenbrand
2023-07-11 12:27     ` Huang, Kai
2023-06-26 14:12 ` [PATCH v12 10/22] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Kai Huang
2023-06-26 14:12 ` [PATCH v12 11/22] x86/virt/tdx: Fill out " Kai Huang
2023-07-04  7:28   ` Yuan Yao
2023-06-26 14:12 ` [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
2023-06-27  9:51   ` kirill.shutemov
2023-07-04  7:40   ` Yuan Yao
2023-07-04  8:59     ` Huang, Kai
2023-07-11 11:42   ` David Hildenbrand
2023-07-11 11:49     ` Huang, Kai
2023-07-11 11:55       ` David Hildenbrand
2023-06-26 14:12 ` [PATCH v12 13/22] x86/virt/tdx: Designate reserved areas for all TDMRs Kai Huang
2023-07-05  5:29   ` Yuan Yao
2023-06-26 14:12 ` [PATCH v12 14/22] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID Kai Huang
2023-07-05  6:49   ` Yuan Yao
2023-06-26 14:12 ` [PATCH v12 15/22] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
2023-07-05  8:13   ` Yuan Yao
2023-06-26 14:12 ` [PATCH v12 16/22] x86/virt/tdx: Initialize all TDMRs Kai Huang
2023-07-06  5:31   ` Yuan Yao
2023-06-26 14:12 ` [PATCH v12 17/22] x86/kexec: Flush cache of TDX private memory Kai Huang
2023-06-26 14:12 ` [PATCH v12 18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful Kai Huang
2023-06-28  9:04   ` Nikolay Borisov
2023-06-29  1:03     ` Huang, Kai
2023-06-28 12:23   ` kirill.shutemov
2023-06-28 12:48     ` Nikolay Borisov
2023-06-29  0:24       ` Huang, Kai
2023-06-26 14:12 ` [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum Kai Huang
2023-06-28  9:20   ` Nikolay Borisov
2023-06-29  0:32     ` Dave Hansen
2023-06-29  0:58       ` Huang, Kai
2023-06-29  3:19     ` Huang, Kai
2023-06-29  5:38       ` Huang, Kai
2023-06-29  9:45         ` Huang, Kai
2023-06-29  9:48           ` Nikolay Borisov
2023-06-28 12:29   ` kirill.shutemov
2023-06-29  0:27     ` Huang, Kai
2023-07-07  4:01   ` Yuan Yao
2023-06-26 14:12 ` [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP Kai Huang
2023-06-28 12:32   ` kirill.shutemov
2023-06-28 15:29   ` Peter Zijlstra
2023-06-28 20:38     ` Peter Zijlstra
2023-06-28 21:11       ` Peter Zijlstra
2023-06-28 21:16         ` Peter Zijlstra
2023-06-30  9:03           ` kirill.shutemov
2023-06-30 10:02             ` Huang, Kai
2023-06-30 10:22               ` kirill.shutemov
2023-06-30 11:06                 ` Huang, Kai
2023-06-29 10:33       ` Huang, Kai
2023-06-30 10:06         ` Peter Zijlstra
2023-06-30 10:18           ` Huang, Kai
2023-06-30 15:16             ` Dave Hansen
2023-07-01  8:16               ` Huang, Kai
2023-06-30 10:21           ` Peter Zijlstra
2023-06-30 11:05             ` Huang, Kai
2023-06-30 12:06             ` Peter Zijlstra
2023-06-30 15:14               ` Peter Zijlstra
2023-07-03 12:15               ` Huang, Kai
2023-07-05 10:21                 ` Peter Zijlstra
2023-07-05 11:34                   ` Huang, Kai
2023-07-05 12:19                     ` Peter Zijlstra
2023-07-05 12:53                       ` Huang, Kai
2023-07-05 20:56                         ` Isaku Yamahata
2023-07-05 12:21                     ` Peter Zijlstra
2023-06-29 11:16       ` kirill.shutemov
2023-06-29 10:00     ` Huang, Kai
2023-06-26 14:12 ` [PATCH v12 21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum Kai Huang
2023-06-28 12:38   ` kirill.shutemov
2023-07-07  7:26   ` Yuan Yao
2023-06-26 14:12 ` [PATCH v12 22/22] Documentation/x86: Add documentation for TDX host support Kai Huang
2023-06-28  7:04 ` [PATCH v12 00/22] TDX host kernel support Yuan Yao
2023-06-28  8:12   ` Huang, Kai
2023-06-29  1:01     ` Yuan Yao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).