kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: linux-mm@kvack.org, dave.hansen@intel.com,
	kirill.shutemov@linux.intel.com, tony.luck@intel.com,
	peterz@infradead.org, tglx@linutronix.de, seanjc@google.com,
	pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com,
	rafael.j.wysocki@intel.com, ying.huang@intel.com,
	reinette.chatre@intel.com, len.brown@intel.com,
	ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com,
	sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com,
	sagis@google.com, imammedo@redhat.com, kai.huang@intel.com
Subject: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot
Date: Mon,  5 Jun 2023 02:27:31 +1200	[thread overview]
Message-ID: <5aa7506d4fedbf625e3fe8ceeb88af3be1ce97ea.1685887183.git.kai.huang@intel.com> (raw)
In-Reply-To: <cover.1685887183.git.kai.huang@intel.com>

The first few generations of TDX hardware have an erratum.  A partial
write to a TDX private memory cacheline will silently "poison" the
line.  Subsequent reads will consume the poison and generate a machine
check.  According to the TDX hardware spec, neither of these things
should have happened.

== Background ==

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

== Problem ==

A fast warm reset doesn't reset TDX private memory.  Kexec() can also
boot into the new kernel directly.  Thus if the old kernel has enabled
TDX on the platform with this erratum, the new kernel may get unexpected
machine check.

Note that w/o this erratum any kernel read/write on TDX private memory
should never cause machine check, thus it's OK for the old kernel to
leave TDX private pages as is.

== Solution ==

In short, with this erratum, the kernel needs to explicitly convert all
TDX private pages back to normal to give the new kernel a clean slate
after either a fast warm reset or kexec().

There's no existing infrastructure to track TDX private pages, which
could be PAMT pages, TDX guest private pages, or SEPT (secure EPT)
pages.  The latter two are yet to be implemented thus it's not certain
how to track them for now.

It's not feasible to query the TDX module either because VMX has already
been stopped when KVM receives the reboot notifier.

Another option is to blindly convert all memory pages.  But this may
bring non-trivial latency to machine reboot and kexec() on large memory
systems (especially when the number of TDX private pages is small).  A
final solution should be tracking TDX private pages and only converting
them.  Also, it's problematic to convert all memory pages because not
all pages are mapped as writable in the direct-mapping.  Thus to do so
would require switching to a new page table which maps all pages as
writable.  Such page table can either be a new page table, or the
identical mapping table built during kexec().  Using either seems too
dramatic, especially considering the kernel should eventually be able
to track all TDX private pages in which case the direct-mapping can be
directly used.

So for now just convert PAMT pages.  Converting TDX guest private pages
and SEPT pages can be added when supporting TDX guests is added to the
kernel.

Introduce a new "x86_platform_ops::memory_shutdown()" callback as a
placeholder to convert all TDX private memory, and call it at the end of
machine_shutdown() after all remote cpus have been stopped (thus no more
TDX activities) and all dirty cachelines of TDX private memory have been
flushed (thus no more later cacheline writeback).

Implement the default callback as a noop function.  Replace the callback
with TDX's own implementation when the platform has this erratum in TDX
early boot-time initialization.  In this way only the platforms with
this erratum carry this additional memory conversion burden.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v10 -> v11:
 - New patch

---
 arch/x86/include/asm/x86_init.h |  1 +
 arch/x86/kernel/reboot.c        |  1 +
 arch/x86/kernel/x86_init.c      |  2 ++
 arch/x86/virt/vmx/tdx/tdx.c     | 57 +++++++++++++++++++++++++++++++++
 4 files changed, 61 insertions(+)

diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 88085f369ff6..d2c6742b185a 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -299,6 +299,7 @@ struct x86_platform_ops {
 	void (*get_wallclock)(struct timespec64 *ts);
 	int (*set_wallclock)(const struct timespec64 *ts);
 	void (*iommu_shutdown)(void);
+	void (*memory_shutdown)(void);
 	bool (*is_untracked_pat_range)(u64 start, u64 end);
 	void (*nmi_init)(void);
 	unsigned char (*get_nmi_reason)(void);
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index b3d0e015dae2..6aadfec8df7a 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -720,6 +720,7 @@ void native_machine_shutdown(void)
 
 #ifdef CONFIG_X86_64
 	x86_platform.iommu_shutdown();
+	x86_platform.memory_shutdown();
 #endif
 }
 
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index d82f4fa2f1bf..344250b35a5d 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -31,6 +31,7 @@ void x86_init_noop(void) { }
 void __init x86_init_uint_noop(unsigned int unused) { }
 static int __init iommu_init_noop(void) { return 0; }
 static void iommu_shutdown_noop(void) { }
+static void memory_shutdown_noop(void) { }
 bool __init bool_x86_init_noop(void) { return false; }
 void x86_op_int_noop(int cpu) { }
 int set_rtc_noop(const struct timespec64 *now) { return -EINVAL; }
@@ -142,6 +143,7 @@ struct x86_platform_ops x86_platform __ro_after_init = {
 	.get_wallclock			= mach_get_cmos_time,
 	.set_wallclock			= mach_set_cmos_time,
 	.iommu_shutdown			= iommu_shutdown_noop,
+	.memory_shutdown		= memory_shutdown_noop,
 	.is_untracked_pat_range		= is_ISA_range,
 	.nmi_init			= default_nmi_init,
 	.get_nmi_reason			= default_get_nmi_reason,
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 8ff07256a515..0aa413b712e8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
 		tdmr_pamt_base += pamt_size[pgsz];
 	}
 
+	/*
+	 * tdx_memory_shutdown() also reads TDMR's PAMT during
+	 * kexec() or reboot, which could happen at anytime, even
+	 * during this particular code.  Make sure pamt_4k_base
+	 * is firstly set otherwise tdx_memory_shutdown() may
+	 * get an invalid PAMT base when it sees a valid number
+	 * of PAMT pages.
+	 */
 	tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
 	tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
 	tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
@@ -1318,6 +1326,46 @@ static struct notifier_block tdx_memory_nb = {
 	.notifier_call = tdx_memory_notifier,
 };
 
+static void tdx_memory_shutdown(void)
+{
+	/*
+	 * Convert all TDX private pages back to normal if the platform
+	 * has "partial write machine check" erratum.
+	 *
+	 * For now there's no existing infrastructure to tell whether
+	 * a page is TDX private memory.  Using SEAMCALL to query TDX
+	 * module isn't feasible either because: 1) VMX has been turned
+	 * off by reaching here so SEAMCALL cannot be made; 2) Even
+	 * SEAMCALL can be made the result from TDX module may not be
+	 * accurate (e.g., remote CPU can be stopped while the kernel
+	 * is in the middle of reclaiming one TDX private page and doing
+	 * MOVDIR64B).
+	 *
+	 * One solution could be just converting all memory pages, but
+	 * this may bring non-trivial latency on large memory systems
+	 * (especially when the number of TDX private pages is small).
+	 * Looks eventually the kernel should track TDX private pages and
+	 * only convert these.
+	 *
+	 * Also, not all pages are mapped as writable in direct mapping,
+	 * thus it's problematic to do so.  It can be done by switching
+	 * to the identical mapping page table built for kexec(), which
+	 * maps all pages as writable, but the complexity looks overkill.
+	 *
+	 * Thus instead of doing something dramatic to convert all pages,
+	 * only convert PAMTs for now as for now TDX private pages can
+	 * only be PAMT.  Converting TDX guest private pages and Secure
+	 * EPT pages can be added later when the kernel has a proper way
+	 * to track these pages.
+	 *
+	 * All other cpus are already dead, thus it's safe to read TDMRs
+	 * to find PAMTs w/o holding any kind of locking here.
+	 */
+	WARN_ON_ONCE(num_online_cpus() != 1);
+
+	tdmrs_reset_pamt_all(&tdx_tdmr_list);
+}
+
 static int __init tdx_init(void)
 {
 	u32 tdx_keyid_start, nr_tdx_keyids;
@@ -1356,6 +1404,15 @@ static int __init tdx_init(void)
 	tdx_guest_keyid_start = ++tdx_keyid_start;
 	tdx_nr_guest_keyids = --nr_tdx_keyids;
 
+	/*
+	 * On the platform with erratum all TDX private pages need to
+	 * be converted back to normal before rebooting (warm reset) or
+	 * before kexec() booting to the new kernel, otherwise the (new)
+	 * kernel may get unexpected SRAR machine check exception.
+	 */
+	if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+		x86_platform.memory_shutdown = tdx_memory_shutdown;
+
 	return 0;
 no_tdx:
 	return -ENODEV;
-- 
2.40.1


  parent reply	other threads:[~2023-06-04 14:31 UTC|newest]

Thread overview: 174+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-04 14:27 [PATCH v11 00/20] TDX host kernel support Kai Huang
2023-06-04 14:27 ` [PATCH v11 01/20] x86/tdx: Define TDX supported page sizes as macros Kai Huang
2023-06-04 14:27 ` [PATCH v11 02/20] x86/virt/tdx: Detect TDX during kernel boot Kai Huang
2023-06-06 14:00   ` Sathyanarayanan Kuppuswamy
2023-06-06 22:58     ` Huang, Kai
2023-06-06 23:44   ` Isaku Yamahata
2023-06-19 12:12   ` David Hildenbrand
2023-06-19 23:58     ` Huang, Kai
2023-06-04 14:27 ` [PATCH v11 03/20] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Kai Huang
2023-06-08  0:08   ` kirill.shutemov
2023-06-04 14:27 ` [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum Kai Huang
2023-06-06 12:38   ` kirill.shutemov
2023-06-06 22:58     ` Huang, Kai
2023-06-07 15:06       ` kirill.shutemov
2023-06-07 14:15   ` Dave Hansen
2023-06-07 22:43     ` Huang, Kai
2023-06-19 11:37       ` Huang, Kai
2023-06-20 15:44         ` Dave Hansen
2023-06-20 23:11           ` Huang, Kai
2023-06-19 12:21   ` David Hildenbrand
2023-06-20 10:31     ` Huang, Kai
2023-06-20 15:39     ` Dave Hansen
2023-06-20 16:03       ` David Hildenbrand
2023-06-20 16:21         ` Dave Hansen
2023-06-04 14:27 ` [PATCH v11 05/20] x86/virt/tdx: Add SEAMCALL infrastructure Kai Huang
2023-06-06 23:55   ` Isaku Yamahata
2023-06-07 14:24   ` Dave Hansen
2023-06-07 18:53     ` Isaku Yamahata
2023-06-07 19:27       ` Dave Hansen
2023-06-07 19:47         ` Isaku Yamahata
2023-06-07 20:08           ` Sean Christopherson
2023-06-07 20:22             ` Dave Hansen
2023-06-08  0:51               ` Huang, Kai
2023-06-08 13:50                 ` Dave Hansen
2023-06-07 22:56     ` Huang, Kai
2023-06-08 14:05       ` Dave Hansen
2023-06-19 12:52   ` David Hildenbrand
2023-06-20 10:37     ` Huang, Kai
2023-06-20 12:20       ` kirill.shutemov
2023-06-20 12:39         ` David Hildenbrand
2023-06-20 15:15     ` Dave Hansen
2023-06-04 14:27 ` [PATCH v11 06/20] x86/virt/tdx: Handle SEAMCALL running out of entropy error Kai Huang
2023-06-07  8:19   ` Isaku Yamahata
2023-06-07 15:08   ` Dave Hansen
2023-06-07 23:36     ` Huang, Kai
2023-06-08  0:29       ` Dave Hansen
2023-06-08  0:08   ` kirill.shutemov
2023-06-09 14:42   ` Nikolay Borisov
2023-06-12 11:04     ` Huang, Kai
2023-06-19 13:00   ` David Hildenbrand
2023-06-20 10:39     ` Huang, Kai
2023-06-20 11:14       ` David Hildenbrand
2023-06-04 14:27 ` [PATCH v11 07/20] x86/virt/tdx: Add skeleton to enable TDX on demand Kai Huang
2023-06-05 21:23   ` Isaku Yamahata
2023-06-05 23:04     ` Huang, Kai
2023-06-05 23:08       ` Dave Hansen
2023-06-05 23:24         ` Huang, Kai
2023-06-07 15:22   ` Dave Hansen
2023-06-08  2:10     ` Huang, Kai
2023-06-08 13:43       ` Dave Hansen
2023-06-12 11:21         ` Huang, Kai
2023-06-19 13:16   ` David Hildenbrand
2023-06-19 23:28     ` Huang, Kai
2023-06-04 14:27 ` [PATCH v11 08/20] x86/virt/tdx: Get information about TDX module and TDX-capable memory Kai Huang
2023-06-07 15:25   ` Dave Hansen
2023-06-08  0:27   ` kirill.shutemov
2023-06-08  2:40     ` Huang, Kai
2023-06-08 11:41       ` kirill.shutemov
2023-06-08 13:13         ` Dave Hansen
2023-06-12  2:00           ` Huang, Kai
2023-06-08 23:29         ` Isaku Yamahata
2023-06-08 23:54           ` kirill.shutemov
2023-06-09  1:33             ` Isaku Yamahata
2023-06-09 10:02   ` kirill.shutemov
2023-06-12  2:00     ` Huang, Kai
2023-06-19 13:29   ` David Hildenbrand
2023-06-19 23:51     ` Huang, Kai
2023-06-04 14:27 ` [PATCH v11 09/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Kai Huang
2023-06-07 15:48   ` Dave Hansen
2023-06-07 23:22     ` Huang, Kai
2023-06-08 22:40   ` kirill.shutemov
2023-06-04 14:27 ` [PATCH v11 10/20] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Kai Huang
2023-06-07 15:54   ` Dave Hansen
2023-06-07 15:57   ` Dave Hansen
2023-06-08 10:18     ` Huang, Kai
2023-06-08 22:52   ` kirill.shutemov
2023-06-12  2:21     ` Huang, Kai
2023-06-12  3:01       ` Dave Hansen
2023-06-04 14:27 ` [PATCH v11 11/20] x86/virt/tdx: Fill out " Kai Huang
2023-06-07 16:05   ` Dave Hansen
2023-06-08 10:48     ` Huang, Kai
2023-06-08 13:11       ` Dave Hansen
2023-06-12  2:33         ` Huang, Kai
2023-06-12 14:33           ` kirill.shutemov
2023-06-12 22:10             ` Huang, Kai
2023-06-13 10:18               ` kirill.shutemov
2023-06-13 23:19                 ` Huang, Kai
2023-06-08 23:02   ` kirill.shutemov
2023-06-12  2:25     ` Huang, Kai
2023-06-09  4:01   ` Sathyanarayanan Kuppuswamy
2023-06-12  2:28     ` Huang, Kai
2023-06-14 12:31   ` Nikolay Borisov
2023-06-14 22:45     ` Huang, Kai
2023-06-04 14:27 ` [PATCH v11 12/20] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Kai Huang
2023-06-08 23:24   ` kirill.shutemov
2023-06-08 23:43     ` Dave Hansen
2023-06-12  2:52       ` Huang, Kai
2023-06-25 15:38     ` Huang, Kai
2023-06-15  7:48   ` Nikolay Borisov
2023-06-04 14:27 ` [PATCH v11 13/20] x86/virt/tdx: Designate reserved areas for all TDMRs Kai Huang
2023-06-08 23:53   ` kirill.shutemov
2023-06-04 14:27 ` [PATCH v11 14/20] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID Kai Huang
2023-06-08 23:53   ` kirill.shutemov
2023-06-04 14:27 ` [PATCH v11 15/20] x86/virt/tdx: Configure global KeyID on all packages Kai Huang
2023-06-08 23:53   ` kirill.shutemov
2023-06-15  8:12   ` Nikolay Borisov
2023-06-15 22:24     ` Huang, Kai
2023-06-19 14:56       ` kirill.shutemov
2023-06-19 23:38         ` Huang, Kai
2023-06-04 14:27 ` [PATCH v11 16/20] x86/virt/tdx: Initialize all TDMRs Kai Huang
2023-06-09 10:03   ` kirill.shutemov
2023-06-04 14:27 ` [PATCH v11 17/20] x86/kexec: Flush cache of TDX private memory Kai Huang
2023-06-09 10:14   ` kirill.shutemov
2023-06-04 14:27 ` Kai Huang [this message]
2023-06-09 13:23   ` [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot kirill.shutemov
2023-06-12  3:06     ` Huang, Kai
2023-06-12  7:58       ` kirill.shutemov
2023-06-12 10:27         ` Huang, Kai
2023-06-12 11:48           ` kirill.shutemov
2023-06-12 13:18             ` David Laight
2023-06-12 13:47           ` Dave Hansen
2023-06-13  0:51             ` Huang, Kai
2023-06-13 11:05               ` kirill.shutemov
2023-06-14  0:15                 ` Huang, Kai
2023-06-13 14:25               ` Dave Hansen
2023-06-13 23:18                 ` Huang, Kai
2023-06-14  0:24                   ` Dave Hansen
2023-06-14  0:38                     ` Huang, Kai
2023-06-14  0:42                       ` Huang, Kai
2023-06-19 11:43             ` Huang, Kai
2023-06-19 14:31               ` Dave Hansen
2023-06-19 14:46                 ` kirill.shutemov
2023-06-19 23:35                   ` Huang, Kai
2023-06-19 23:41                   ` Dave Hansen
2023-06-20  0:56                     ` Huang, Kai
2023-06-20  1:06                       ` Dave Hansen
2023-06-20  7:58                         ` Peter Zijlstra
2023-06-25 15:30                         ` Huang, Kai
2023-06-25 23:26                           ` Huang, Kai
2023-06-20  7:48                     ` Peter Zijlstra
2023-06-20  8:11       ` Peter Zijlstra
2023-06-20 10:42         ` Huang, Kai
2023-06-20 10:56           ` Peter Zijlstra
2023-06-14  9:33   ` Huang, Kai
2023-06-14 10:02     ` kirill.shutemov
2023-06-14 10:58       ` Huang, Kai
2023-06-14 11:08         ` kirill.shutemov
2023-06-14 11:17           ` Huang, Kai
2023-06-04 14:27 ` [PATCH v11 19/20] x86/mce: Improve error log of kernel space TDX #MC due to erratum Kai Huang
2023-06-05  2:13   ` Xiaoyao Li
2023-06-05 23:05     ` Huang, Kai
2023-06-09 13:17   ` kirill.shutemov
2023-06-12  3:08     ` Huang, Kai
2023-06-12  7:59       ` kirill.shutemov
2023-06-12 13:51         ` Dave Hansen
2023-06-12 23:31           ` Huang, Kai
2023-06-04 14:27 ` [PATCH v11 20/20] Documentation/x86: Add documentation for TDX host support Kai Huang
2023-06-08 23:56   ` Dave Hansen
2023-06-12  3:41     ` Huang, Kai
2023-06-16  9:02   ` Nikolay Borisov
2023-06-16 16:26     ` Dave Hansen
2023-06-06  0:36 ` [PATCH v11 00/20] TDX host kernel support Isaku Yamahata
2023-06-08 21:03 ` Dan Williams
2023-06-12 10:56   ` Huang, Kai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5aa7506d4fedbf625e3fe8ceeb88af3be1ce97ea.1685887183.git.kai.huang@intel.com \
    --to=kai.huang@intel.com \
    --cc=ak@linux.intel.com \
    --cc=bagasdotme@gmail.com \
    --cc=chao.gao@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=david@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=isaku.yamahata@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rafael.j.wysocki@intel.com \
    --cc=reinette.chatre@intel.com \
    --cc=sagis@google.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).